This week I have made progress on the php tree-crdt code. In particular, I implemented the suggested execution optimization from the paper (lookup child_id in a hashmap). This sped up move operations but did not help listing child nodes, which is necessary for efficiently traversing the tree. So I added a parent --> [child,…] index as well. With this in place, tree walks become quite fast. I also added lamport+actor timestamps, which actually work in a multi-process replicated situation… ie the impl doesn’t rely on any global knowledge crutch now. I also fixed an issue with dup ops entering the log if they have the same timestamp, added several test cases, etc. intend to make a little wiki writeup about it soon. The code is up on github if you’d like to check it out, and both the set-based (tree-set.php) and hashmap based (tree.php) versions can be run/compared.
The only major remaining work on the php code (today) is to implement basic log truncation (as per paper) and get some hands on experience with it. After that, I should be ready to start translating into rust next week.
I am not nearly at that point yet. There are still several unresolved issues in my mind, as well as lack of hands-on experience.
Just to name a few:
How exactly do we implement inode support?
How do we truncate logs, given that the complete set of replicas is not well known, and old replicas may come online at any time? ( today I am working on log truncation as specified in the paper, with known set of replicas.)
How exactly do we serialize metadata such as inode name, size, type, and so on? How can we do this most compactly/efficiently? Should we support extended/extensible attributes?
Ok, I have a couple thoughts to move us forward:
I would like to understand better some things that I haven’t had time yet to look into in depth. If you have time, that could be helpful:
a) how does a fuse filesystem come into being the first time? ie, how does init() work?
b) using the low-level API, how does a path get translated to an inode? eg, if I say ls /tmp/foo, when/how does /tmp/foo get translated to a u64? (which calls?)
c) when I create a file touch /tmp/bar, an inode record is created and a u64 to identify it. when/how is the u64 created?
The crdt paper does give some further hope about using inodes in section 3.6, but I don’t fully grok it yet, perhaps because I haven’t coded a filesystem from scratch before and have the above uncertainties regarding inodes.
Unix filesystems support hardlinks, which allow the same file inode to be referenced from multiple locations in a tree. Our tree data structure can easily be extended to support this: rather than placing file data directly in the leaf nodes of the tree, the leaf node must reference the file inode. Thus, references to the same file inode can appear in multiple leaf nodes of the tree. Symlinks are also easy to support, since they are just leaf nodes containing a path (not a reference to an inode).
So from that description, if leaf nodes only contain references to inode records, then I still don’t get: (a) where are the inode records are actually stored and (b) this doesn’t seem to resolve the u64 to UUID issue at all. @happybeing if you have any insights on this, let me know. Otherwise, I’m thinking to contact the paper authors after I understand a bit more about inode usage.
All the SAFE+network aspects and layers complicate things a lot. I’m thinking let’s start simple instead. crawl, walk, run. prove things out. We can embed the rust crdt tree type directly into the fuse stub that you build. It doesn’t need to do any communication with another peer/replica… just operating on its own. No other SAFE libs involved at all. This exercise will prove that the data type is workable (or not) for a local filesystem. We can identify/resolve the inode mapping issue. We can quickly have a functioning local fuse filesystem and begin to evaluate its performance characteristics as well as logging/caching, or any problem areas. Basically, if we can make this work well standalone, then that is a solid proof of concept. A networked version of it should then be doable and a logical extension, given the crdt properties.
Overall, looks a bit like the PHP filesystem api to me. Coming from a C background, I’ve always found those APIs very easy to use/remember as most are either thin wrappers around C functions or convenience functions that wrapper a few together, so that’s a plus in my book, though some may disagree. It’s probably best that we not make things too rust-specific, as there will be bindings for lots of other langs. My personal preference would be to keep func names the same as in C, rather than adding an _, ie getattr not get_attr.
Regarding the simplified APIs, I’m not really thinking that high level yet, but in general I’d suggest taking a look at what various languages do, eg PHP, python, rust, c#, etc and I think we can identify a common set of useful routines.
yes, I’d like to plugin the crdt-tree data type directly to test out a local-only filesystem, as I proposed in the preceding message. I’m hoping to have some prototype rust code working by end of next week, barring difficulties. Regarding file contents, it could be stored in metadata initially, or possibly in another mounted filesystem.
@danda just to say I’m working away both to improve my understanding of FUSE and filesystems, and to try and come up with some thoughts - maybe even answers - to the questions we have. I’ve covered a lot of ground and clarified some things by writing about filesystems and how to implement them on the FileTree with CDRT etc. I have some more to do and will then respond to your posts above.
Great to see you making progress with the Tree CDRT.
Ditto. I find it useful to write things down even if they get torn up so don’t take anything I come up with as the way I think it should be.
I have been mixing research about filesystems with thinking about implementation of SAFE FS for FUSE and marrying these with how I imagine the FileTree will need to work, and have made notes about possible implementations. But rather than force a lot of reading on anyone I’ll respond to your questions and drop bit in where relevant.
The billion dollar question. I’ve been thinking about this and have more investigation to do to establish how closely SAFE FS should try to emulate typical inode features, or whether for example, it can create and destroy them more like file handles. I don’t know if that’s a useful idea or not, or whether to place that in SAFE FS or FileTree. For example, it may help us by decoupling the needs of the FileTree/TreeCDRT node identifiers from those exposed by SAFE FS.
I’ve summarised my understanding of Unix/Linux filesystem, including inodes (here). There are some things I probably need to look into further, so I made that a Wiki topic for this which can be updated and discussed separately.
I’d also like to know how the idea of a u64 node identifier might work (or not) with a TreeCDRT if you have any thoughts about this. More on that in a bit.
I don’t know about the CDRT side of this but see you’ve made progress. I have a couple of thoughts of how to manage the truncation to keep it trim without losing functionality:
I think it’s desirable to be able to access the full log rather than ever lose it completely. I think each replica can do this, by moving any pruned entries to a Sequence owned by the replica so that the full log can be be reconstructed, while only a shorter one is needed for most operations. This will preserve the ability to roll any FileTree back through all its previous states, or to merge any two FileTree replicas regardless of age and pruning. Maybe this is a ‘for later’ feature, but we could make preserving pruned entries the priority and implement features to make use of this later.
it might be helpful for users to be able to control how and when the log is pruned to suit different use cases. This could either be set when creating the first replica, or perhaps something that can be modified at any time.
EDIT: Another thought as I think truncation has a second aspect. The first is trimming the log so it doesn’t become so large that performance degrades unacceptably.
The second is deciding to have a cut off, before which no replica is merged if it has unmerged entries proper to the cut off. Why might that be useful? I’m not sure, but it occurred to me while I was thinking about the inode problem, trying to avoid clashes for numbers allocated independently. I came up with the following solution which I think would require us to limit how far back mergers should be allowed to go. The downside of that is that all topics will have to sync to stay ahead of the cut-off, or face their changes being blocked. That is recoverable though through a user managed re-sync, and I think for most use cases unlikely to be an issue, so how might having a cut off help as described in: Allocating inode numbers using pre-allocated blocks
I’ve not thought about this so idiot mode activated… The FileTree will be responsible for this because it will be a part of the CDRT operations, and so it will need to keep track of the metadata for each node, serialise it and retrieve it etc.
As for how to serialise it I’m not sure. Does it make sense to have a column in the log for each attribute or to have a single column for all metadata? In the latter case perhaps the metadata is serialised to a Sequence, in the former to the log itself but using run-length-encoding to avoid having to store values which rarely change, or maybe a bit of both?
Assuming this is just more metadata that’s something we could bolt on later so not an urgent question, but in general I’d say we try to implement everything we can in the long term in ways that don’t hurt performance until they are made use of by an application.
The init() operation is used to inform us about the characteristics and parameters of the FUSE kernel, and allows us to tweak some things if we want. Nothing exciting, but will need attention at some point (see docs).
You may be more interested in how we start up when a user tries to create or mount a SAFE FS filesystem. I expect we’ll have an API for creating a new FileTree object and from that object gain access to a SAFE FS API which SAFE FUSE can use to provide the filesystem operations on the FileTree.
So for example a safe-fuse CLI will be invoked with parameters telling it to either create a brand new filesystem (i.e. create an empty FileTree object for it), or an existing replica (i.e. fetch an existing FileTree from the network), and mount it at a given path. Once safe-fuse has a FileTree it obtains a SAFE FS API from it, then passes its FUSE operations to libfuse along with the requested mount point.
(Not necessarily in that order! As in, it might be better to create the mount first but block any operations until we have obtained the FileTree rather than keep the user waiting for that before failing the operation because they gave an invalid mount point.)
Another ‘I think’: from the application side you must always start with a path (you can’t just have an inode number and begin using that, is my understanding - except perhaps for the ‘root’). From the path you can obtain an inode which identifies the thing for subsequent operations on the same thing (typically a file or directory). This is hidden from view when using high-level FUSE. With low-level FUSE, the you call lookup() to get an inode number, which you then use for subsequent operations on the object it represents (e.g. a file or directory). The lookup() increments a hard-link count for the inode, so your process now effectively has a hard-link to the object, which is released when you call forget() which decrements the hard-link count. I see these hard-links as different (ephemeral) from hard-links from a directory entry to a file (enduring), and may be handled differently in our implementation. For example, if a process crashes before calling forget() it is probably better for the ‘link’ to die with it than to leave the inode in a state where its hard-link count will not reach zero when all its other hard-links have been removed.
Part of my to-do list is to look into this: how important it is that SAFE FS inodes correspond to typical filesystem inodes as it may be useful to deviate from this if it makes life easier for the FileTree/TreeCDRT implementation and its interface to SAFE FS.
I imagine it typical filesystem initialises an empty fixed sized table of inode references which has entries filled in or erased as file system objects are created and destroyed.
In terms of the APIs, the filesystem will call our FUSE API with a path in order to create an inode. We respond with the inode number, and it will use that for subsequent operations. So we would probably create an entry in the tree at this point, allocate it a u64 and respond with that.
OS-FUSE obtains the inode number for an entry in the root (’/’) using an inode number of 1, so to get the inode for ‘/tmp’ it will call lookup(req_handle, 1, 'tmp')
If ‘/tmp’ exists we respond with a value, say I’ll call tmp_inode (which might be 2).
OS-FUSE then calls our create(req, tmp_inode, 'bar') to create the file entry.
We find the FileTree node which has an id of tmp_inode, create a child entry with name ‘bar’ and allocate an inode number for the entry (bar_inode) and initialise the metadata for the entry (e.g. creation time).
We respond with the inode number (bar_inode) of the new entry.
If that’s not complicated enough there’s a bit more to it than this so it is going to need a lot of reading and careful thought to ensure a clean, robust design.
Just read this part (possibly again), and it is what I had concluded after lots of thinking about how to implement hard-links. TL;DR: it works so long as we only need links to leaf nodes. This is ok IMO because only MacOS allows hard-links to directories.
Re a): Where the inode has only one reference, i.e. for directories, symlinks and files which don’t have extra hard-links, we could use the FileTree/TreeCDRT node to hold the metadata of the inode. In this case a file is just metadata, including an xor address (similar to a FilesContainer entry). So for most cases we just need a way to map between inode number and FileTree/TreeCDRT node in order to access the data of an inode.
In the case of a hard linked file, we will need a separate object to hold the inode data which can be referenced by every entry which hard-links to it from a directory. I have a scheme in mind where we store that in a Sequence but I won’t go into details here. If this passes scrutiny, the nice part is that we can start without support for hard-links and add it later.
Re b): I need a recap of this. Can you point me at anything or write a post explaining how the UUID is used, how the CDRT operates in this area? My head is filled with FUSE and I need a refresh of the CDRT side (or maybe a summary doc of what you envisage would help).
I agree with building in that way, and also find it useful to set out what I’m aiming at long term even if we end up somewhere else! I can also try to see the stepping stones we can use along that incremental path (e.g. the point about being able to add hard links later).
@happybeing good stuff. I’m headed out the door now. Will give it some thought and get back to you later.
re UUID, the tree is logically composed of a set of triples: (parent_id, metadata, child_id). There is a requirement that the parent_id and child_id must be globally unique, such that any actor (replica) can create one and it will not conflict with an ID created by another actor.
The metadata can contain things like name, size, etc. So we could stick a u64 in there, perhaps generated by some kind of hash of the uuid (simplest, though collisions possible). Or possibly it could be implemented as a g_counter, which is a crdt grow-only counter type. We probably would need to somehow index name and ino fields of the metadata for fast lookup in large-directory situations.
So, the challenge then is how to achieve the indirection in the second example, using our crdt-tree.
My first thought was that we could store the inode_entry data in an immutable object, but this is pretty terrible because its basically useless for working offline, would have super slow lookups, the XorName is based on changing metadata, etc, etc. I hate it.
So, the question becomes: how do we store the inode_entry in the crdt-tree itself? Well, I think there is a way.
The paper discusses that the data type supports multiple root nodes, not just one. The authors call this a forest, and suggest a method to implement deletions by creating a trash node as a sibling to filesystem root, and each delete (final unlink) is implemented as a move to the trash. I recently implemented a test case for this, and the output visually demonstrates:
$ php tree.php test_move_to_trash
After project moved to trash (deleted) on both replicas
Delete op is now causally stable, so we can empty trash:
So my idea now is that we can create another sibling of root called inodes and we store each inode_entry as a child of inodes, in the tree-node’s metadata, eg:
This is genius @danda and I’m kicking myself for not thinking of it. I can’t see any problems with this, and it seems very clean.
I need to look into a question I have about inode reference counts and a couple of other things that I don’t understand, and will get onto the test FUSE with the aim of having it ready by Friday.
I might try using the polyfuse which David pointed to as a way of understanding whether it’s a good option.
Does that sound sensible?
EDIT: I found a good description of POSIX filesystem implementation which explains how the things I’ve summarised in the POSIX topic fit together, including how inodes are created and manipulated (here) followed by the set of inode operations (here).
It’s well worth a read, and I’m wondering if it makes sense for us to try to mirror these APIs.
I suggest we try, because this will act is documentation for how things work without us having to write it, and we might also find a filesystem written in Rust that we can sit on top of this to implement our SAFE FS API, which could be a big win. Anyway, have a read and let’s try and figure out if we can do this on top of your node tree.
Just catching up on these which I’ve not answered so far…
Fair enough, I’ll just remove that suggestion, and I’ve added a note that we may implement this using existing traits such as fuse::Filesystem, and generate language bindings that match the FUSE APIs from this.
Consider all that just a placeholder (putting something there encourages readers to think about it).
As mentioned I’m starting to think about this. I keep finding holes in my knowledge and having to go look for info, so it is proving a slow start but getting there. There are several examples I could use for a framework.
What I need to do is define how it will interact with your FileTree which I’ll share as soon as I think I have somehting, or if you get there first please tell me what will work for you.
@danda from your last two replies can I take it that you have a workable plan (both re UUID as perhaps G-counter, and for hard-links using refs to a list of inodes)?
I’ll assume so for now. BTW, in my search for Rust crates related to filesystems I turned up a few interesting things. The following might be useful for testing:
assert_fs - Filesystem fixtures and assertions for testing (libs.rs)
test-generator - enumerating entries according to file-system pattern and generating a test function for each entry (libs.rs)
integrity-checker - for backups and filesystems (lib.rs)
If you can do so quickly, please can you summarise the operations we will have available on the TreeCDRT so I can look into how to implement the FS APIs using them. For example, is moving a parent node an atomic API or do with have the option to move it in steps such as: 1) remove inode-A from its parent inode-B, 2) add inode-A as child of inode-C.
Also, what would be the operations to mutate metadata?
If you don’t have time I assume I can get them from the paper.
@happybeing an interesting aspect of the paper we will eventually need to deal with is this:
One final type of conflict that we have not discussed so far is multiple child nodes with the same parent and the same metadata. For example in a filesystem, two users could concurrently create files with the same name in the same directory. Our algorithm does not prevent such a conflict, but simply retains both child nodes. In practice, the collision would be resolved by making the filenames distinct, e.g. by appending a replica identifier to the filenames.
Appending a replica identifier to each filename is a bit gross, but may be best option.
This also relates to storing the ino(u64) in the metadata. If each replica were to simply increment the last known ino(u64) when adding a node, then we would get conflicts. I further realized gcounter doesn’t really help with ino identifiers because it is just a shared counter, there is no global uniqueness guarantee. So maybe we are back to the idea of hashing the uuid (unless/until a better strategy presents itself).
Which corresponds to a file structure on the mounted device of:
Where both the above are links to the same file, ino<5>.
In the filesystem, the root is also an inode (directory ino<1>) with at least two directory entries (for . and ..) plus in this example two more for file1 and file2.
We don’t need to store entries for . and .., instead we can fill them out when needed using the self and parent ino numbers. We could also fake the root inode but I think it will simplify the code in both safe-fs and TreeCDRT to have it in the tree structure.
So, in the example (and for the the root directory only) both . and .. will resolve to the root inode ino<1>, whereas all in other cases .. resolves to the ino of the parent. And file1 and file2 both resolve to ino<5>.
Note that in filesystem implementations (eg the Linux VFS) ino<0> is used to initialise an empty directory entry, then the inode object is created and metadata initialised (eg ino, creation time, group, owner, permissions etc) and the ino can be assigned to the directory entry.
So I think you need to respect the conventions for ino<0> and ino<1>, but other than that allocating them can be left to the TreeCDRT.
Feels like it’s coming together!
EDIT: I got the idea that the root directory has ino<1> from the ‘hello’ FUSE examples, but it seems not always the case and I’ve not yet found out how you discover the ino of root! I’ll leave the above for now, but note:
…most contemporary Unix and Unix-like systems seem to use inode number 1 for tracking bad blocks, pushing the root up to inode number 2 (ref to)
ok, so i’ve started working on a little faux filesystem prototype/test case that will implements apis such as lookup(), mkdir, create, link, unlink(), etc.
The first api I started with is lookup(). It (primarily) returns an ino, but according to some docs I found, it is also supposed to increment the inode’s ref_cnt. Even though it is a purely local, read-only operation.
So that brings something into somewhat sharper focus. If I implement this as a move operation, it means that ops will flow across the network for purely local reads. That’s all fine and good for consistency, but it seems a bit overkill/inefficient.
I’m a bit tempted to implement it that way for now though, as I believe it was knuth who said premature optimization is the root of all evil.
Are you sure the ref count is the same as the link count? I thought it read like that at first reading but it doesn’t make sense. So I’m thinking there’s an in use count and a link count, but at this point that’s just my supposition.
One reason I rejected the first idea is that if a running process crashes after doing lookup, you risk using up inodes because they can never be deleted.
yeah, I wondered that too, and I believe you may be right. Will check the docs in more detail.
But it mostly boils down to the same thing. Either all of the changes in the crdt tree nodes are stored in the op log and (eventually) go across the network, or they don’t.
I/we need to think more about ramifications if some do not. For one thing, those local mods would presumably not be stored in operations log, so presumably would not not be recreated in the state after undo/redo apply cycle. Possibly a 2nd log for local ops must be kept. At which point I start to wonder if data type would still meet crdt requirements.
Another thought: fuse has a bunch of flags for which features a filesystem supports or not. So maybe there is a way to say we don’t support ref_cnt for lookups and other read ops…
I suppose we could have a fully separate data structure outside the crdt-tree that stores local-only info about each inode, since time of mount, such as use_cnt. Possibly this could live in memory, especially if we only need it for use_cnt, then only nodes that are actually in use need be stored.
Please keep and share your list of useful reference docs. It seems hard to find some stuff.
I’m hopeful we can use the sync operation as the boundary, because the filesystem operations must have been designed to cope with a similar (hopefully equivalent) problem. So we can try and match the filesystem API with what we can do in a FileTree, and see how well that works.
That’s what I’ve been thinking, but is yet to be established.