Way too early in the process for an rfc, imho. So far this all falls in the realm of early experimentation.
If you’d like to share what you have on hackmd or somewhere for collaboration, that might be good.
Way too early in the process for an rfc, imho. So far this all falls in the realm of early experimentation.
If you’d like to share what you have on hackmd or somewhere for collaboration, that might be good.
So we have an issue with the u64 as identifier in the crdt-tree.
The tree algo requires that node identifiers are globally unique across all replicas, ie something like a UUID, which is 128 bit. This is necessary for the log operations to merge without conflict, else would not be crdt. Any replica can create an ID. Eg, we could think of a laptop, desktop, tablet, and mobile phone all operating on the same FileTree mount-point.
A random u64 is not sufficiently unique to really qualify as a globally unique identifier.
Actually, this raises a question: in the Fuse API, who actually creates new inode IDs: fuse/kernel or the filesystem implementation? @happybeing do you know?
Assuming we use a UUID internally, for fuse we are back to the issue that inode numbers must somehow be mapped to the UUID and back.
I’m pretty sure it’s created by the filesystem and only applies for the low level API, but will confirm this.
Doesn’t this only apply if we use the lower level FUSE API? If we use the path API, we could map paths to whatever internal ID we choose our use paths directly. I think this mapping will live in the FileTree, and probably not need to be exposed.
I think there’s only reason to use the low level API if we can support hard links, which is TBD.
It’s late so I’ll look again tomorrow.
I’ll publish the draft API doc once I’ve reviewed it.
Indeed, this was my original thinking, as I was suggesting we start with the path API before. But then I forgot about the UUID requirement temporarily.
I think that’s the main reason. Also there may not be a solid path based fuse rust implementation right now for windows.
Neither is necessarily a showstopper. We can defer the decision for now.
I’ve taken a detour around the FUSE implementations while learning about Windows support and here’s my summary wrt high and low level approaches.
Based on the summary below I’m inclined towards the low-level FUSE route if we can map paths to inodes because we have more consistent Rust based options across Linux/MacOS/Windows, and can support hard links on all but Windows. I’m not so familiar with the low-level API though, so would need to look at that in more detail if the inode issue can be handled.
On the last point, one option would be to create a u64 hash of the path for use as the inode. This would avoid the need for a map as all FS access will begin with a path to get to the inode of a file, directory or symlink and the FileTree doesn’t need to know about the path once it knows the inode.
So I did some calculations. Chances of a collision appear very small, one in 2^64 per path, so for a filesystem with:
a billion files (10^9 ~ 2^30) would be one in 2^34 or one in 17,179,869,184 = very safe
ten billion files (10^10 ~ 2^34) would be one in 2^30 or one in 1,073,741,824 = safe enough?
a thousand billion files (10^12 ~ 2^40) would be one in 2^24 or one in 16,777,216 = unsafe but still useful for archiving as some files would be trampled eventually but on SAFE old data is never overwritten so you could always get it back.
Caveats: I plucked all the ‘typical’ figures out of the air and have been known to get maths badly wrong, so both need checking! (ref: table of powers of two on Wikipedia)
Alternatively we can just look up the path in the FileTree to get the inode as this is not a particularly frequent operation, not that expensive (one lookup per level to find the directory, get the directory contents which could be a hash-map, and get the inode from that). If we’re keeping the FileTree in client memory these operations are all in local memory.
I think we can start with one and easily switch to another as well. So I’m voting for the low-level API at this point. I’ve made a summary of both and how they could map to the SAFE FS API here:
Linux / MacOS
Linux / MacOS
The outline of a possible SAFE FS API based on the FUSE APIs is now available:
@danda I think it would do no harm to get developer feedback on any SAFE FS proposal early on especially from the core team (i.e. before we have the implementation fleshed out). The best way to involve core devs is probably on github via an issue or as a draft RFC, even if we end up throwing it away. But first obviously we need to get it to the point where you think it is sensible!
Still waiting on this, also in case we can make a provisional decision on which FUSE API to use:
First, a little update from my side.
This week I have made progress on the php tree-crdt code. In particular, I implemented the suggested execution optimization from the paper (lookup child_id in a hashmap). This sped up move operations but did not help listing child nodes, which is necessary for efficiently traversing the tree. So I added a parent --> [child,…] index as well. With this in place, tree walks become quite fast. I also added lamport+actor timestamps, which actually work in a multi-process replicated situation… ie the impl doesn’t rely on any global knowledge crutch now. I also fixed an issue with dup ops entering the log if they have the same timestamp, added several test cases, etc. intend to make a little wiki writeup about it soon. The code is up on github if you’d like to check it out, and both the set-based (tree-set.php) and hashmap based (tree.php) versions can be run/compared.
The only major remaining work on the php code (today) is to implement basic log truncation (as per paper) and get some hands on experience with it. After that, I should be ready to start translating into rust next week.
I am not nearly at that point yet. There are still several unresolved issues in my mind, as well as lack of hands-on experience.
Just to name a few:
How exactly do we implement inode support?
How do we truncate logs, given that the complete set of replicas is not well known, and old replicas may come online at any time? ( today I am working on log truncation as specified in the paper, with known set of replicas.)
How exactly do we serialize metadata such as inode name, size, type, and so on? How can we do this most compactly/efficiently? Should we support extended/extensible attributes?
Ok, I have a couple thoughts to move us forward:
I would like to understand better some things that I haven’t had time yet to look into in depth. If you have time, that could be helpful:
a) how does a fuse filesystem come into being the first time? ie, how does init() work?
b) using the low-level API, how does a path get translated to an inode? eg, if I say
ls /tmp/foo, when/how does /tmp/foo get translated to a u64? (which calls?)
c) when I create a file
touch /tmp/bar, an inode record is created and a u64 to identify it. when/how is the u64 created?
The crdt paper does give some further hope about using inodes in section 3.6, but I don’t fully grok it yet, perhaps because I haven’t coded a filesystem from scratch before and have the above uncertainties regarding inodes.
Unix filesystems support hardlinks, which allow the same file inode to be referenced from multiple locations in a tree. Our tree data structure can easily be extended to support this: rather than placing file data directly in the leaf nodes of the tree, the leaf node must reference the file inode. Thus, references to the same file inode can appear in multiple leaf nodes of the tree. Symlinks are also easy to support, since they are just leaf nodes containing a path (not a reference to an inode).
So from that description, if leaf nodes only contain references to inode records, then I still don’t get: (a) where are the inode records are actually stored and (b) this doesn’t seem to resolve the u64 to UUID issue at all. @happybeing if you have any insights on this, let me know. Otherwise, I’m thinking to contact the paper authors after I understand a bit more about inode usage.
Overall, looks a bit like the PHP filesystem api to me. Coming from a C background, I’ve always found those APIs very easy to use/remember as most are either thin wrappers around C functions or convenience functions that wrapper a few together, so that’s a plus in my book, though some may disagree. It’s probably best that we not make things too rust-specific, as there will be bindings for lots of other langs. My personal preference would be to keep func names the same as in C, rather than adding an
Regarding the simplified APIs, I’m not really thinking that high level yet, but in general I’d suggest taking a look at what various languages do, eg PHP, python, rust, c#, etc and I think we can identify a common set of useful routines.
yes, I’d like to plugin the crdt-tree data type directly to test out a local-only filesystem, as I proposed in the preceding message. I’m hoping to have some prototype rust code working by end of next week, barring difficulties. Regarding file contents, it could be stored in metadata initially, or possibly in another mounted filesystem.
@danda just to say I’m working away both to improve my understanding of FUSE and filesystems, and to try and come up with some thoughts - maybe even answers - to the questions we have. I’ve covered a lot of ground and clarified some things by writing about filesystems and how to implement them on the FileTree with CDRT etc. I have some more to do and will then respond to your posts above.
Great to see you making progress with the Tree CDRT.
Ditto. I find it useful to write things down even if they get torn up so don’t take anything I come up with as the way I think it should be.
I have been mixing research about filesystems with thinking about implementation of SAFE FS for FUSE and marrying these with how I imagine the FileTree will need to work, and have made notes about possible implementations. But rather than force a lot of reading on anyone I’ll respond to your questions and drop bit in where relevant.
The billion dollar question. I’ve been thinking about this and have more investigation to do to establish how closely SAFE FS should try to emulate typical inode features, or whether for example, it can create and destroy them more like file handles. I don’t know if that’s a useful idea or not, or whether to place that in SAFE FS or FileTree. For example, it may help us by decoupling the needs of the
FileTree/TreeCDRT node identifiers from those exposed by SAFE FS.
I’ve summarised my understanding of Unix/Linux filesystem, including inodes (here). There are some things I probably need to look into further, so I made that a Wiki topic for this which can be updated and discussed separately.
I’d also like to know how the idea of a u64 node identifier might work (or not) with a
TreeCDRT if you have any thoughts about this. More on that in a bit.
I don’t know about the CDRT side of this but see you’ve made progress. I have a couple of thoughts of how to manage the truncation to keep it trim without losing functionality:
I think it’s desirable to be able to access the full log rather than ever lose it completely. I think each replica can do this, by moving any pruned entries to a
Sequence owned by the replica so that the full log can be be reconstructed, while only a shorter one is needed for most operations. This will preserve the ability to roll any
FileTree back through all its previous states, or to merge any two
FileTree replicas regardless of age and pruning. Maybe this is a ‘for later’ feature, but we could make preserving pruned entries the priority and implement features to make use of this later.
it might be helpful for users to be able to control how and when the log is pruned to suit different use cases. This could either be set when creating the first replica, or perhaps something that can be modified at any time.
EDIT: Another thought as I think truncation has a second aspect. The first is trimming the log so it doesn’t become so large that performance degrades unacceptably.
The second is deciding to have a cut off, before which no replica is merged if it has unmerged entries proper to the cut off. Why might that be useful? I’m not sure, but it occurred to me while I was thinking about the inode problem, trying to avoid clashes for numbers allocated independently. I came up with the following solution which I think would require us to limit how far back mergers should be allowed to go. The downside of that is that all topics will have to sync to stay ahead of the cut-off, or face their changes being blocked. That is recoverable though through a user managed re-sync, and I think for most use cases unlikely to be an issue, so how might having a cut off help as described in: Allocating inode numbers using pre-allocated blocks
I’ve not thought about this so idiot mode activated… The
FileTree will be responsible for this because it will be a part of the CDRT operations, and so it will need to keep track of the metadata for each node, serialise it and retrieve it etc.
As for how to serialise it I’m not sure. Does it make sense to have a column in the log for each attribute or to have a single column for all metadata? In the latter case perhaps the metadata is serialised to a
Sequence, in the former to the log itself but using run-length-encoding to avoid having to store values which rarely change, or maybe a bit of both?
Assuming this is just more metadata that’s something we could bolt on later so not an urgent question, but in general I’d say we try to implement everything we can in the long term in ways that don’t hurt performance until they are made use of by an application.
init() operation is used to inform us about the characteristics and parameters of the FUSE kernel, and allows us to tweak some things if we want. Nothing exciting, but will need attention at some point (see docs).
You may be more interested in how we start up when a user tries to create or mount a SAFE FS filesystem. I expect we’ll have an API for creating a new
FileTree object and from that object gain access to a SAFE FS API which SAFE FUSE can use to provide the filesystem operations on the
So for example a safe-fuse CLI will be invoked with parameters telling it to either create a brand new filesystem (i.e. create an empty
FileTree object for it), or an existing replica (i.e. fetch an existing
FileTree from the network), and mount it at a given path. Once safe-fuse has a
FileTree it obtains a SAFE FS API from it, then passes its FUSE operations to libfuse along with the requested mount point.
(Not necessarily in that order! As in, it might be better to create the mount first but block any operations until we have obtained the
FileTree rather than keep the user waiting for that before failing the operation because they gave an invalid mount point.)
Another ‘I think’: from the application side you must always start with a path (you can’t just have an inode number and begin using that, is my understanding - except perhaps for the ‘root’). From the path you can obtain an inode which identifies the thing for subsequent operations on the same thing (typically a file or directory). This is hidden from view when using high-level FUSE. With low-level FUSE, the you call
lookup() to get an inode number, which you then use for subsequent operations on the object it represents (e.g. a file or directory). The
lookup() increments a hard-link count for the inode, so your process now effectively has a hard-link to the object, which is released when you call
forget() which decrements the hard-link count. I see these hard-links as different (ephemeral) from hard-links from a directory entry to a file (enduring), and may be handled differently in our implementation. For example, if a process crashes before calling
forget() it is probably better for the ‘link’ to die with it than to leave the inode in a state where its hard-link count will not reach zero when all its other hard-links have been removed.
Part of my to-do list is to look into this: how important it is that SAFE FS inodes correspond to typical filesystem inodes as it may be useful to deviate from this if it makes life easier for the
FileTree/TreeCDRT implementation and its interface to SAFE FS.
I imagine it typical filesystem initialises an empty fixed sized table of inode references which has entries filled in or erased as file system objects are created and destroyed.
In terms of the APIs, the filesystem will call our FUSE API with a path in order to create an inode. We respond with the inode number, and it will use that for subsequent operations. So we would probably create an entry in the tree at this point, allocate it a u64 and respond with that.
lookup(req_handle, 1, 'tmp')
tmp_inode(which might be 2).
create(req, tmp_inode, 'bar')to create the file entry.
FileTreenode which has an id of
tmp_inode, create a child entry with name ‘bar’ and allocate an inode number for the entry (
bar_inode) and initialise the metadata for the entry (e.g. creation time).
bar_inode) of the new entry.
If that’s not complicated enough there’s a bit more to it than this so it is going to need a lot of reading and careful thought to ensure a clean, robust design.
Just read this part (possibly again), and it is what I had concluded after lots of thinking about how to implement hard-links. TL;DR: it works so long as we only need links to leaf nodes. This is ok IMO because only MacOS allows hard-links to directories.
Re a): Where the inode has only one reference, i.e. for directories, symlinks and files which don’t have extra hard-links, we could use the
FileTree/TreeCDRT node to hold the metadata of the inode. In this case a file is just metadata, including an xor address (similar to a
FilesContainer entry). So for most cases we just need a way to map between inode number and
FileTree/TreeCDRT node in order to access the data of an inode.
In the case of a hard linked file, we will need a separate object to hold the inode data which can be referenced by every entry which hard-links to it from a directory. I have a scheme in mind where we store that in a
Sequence but I won’t go into details here. If this passes scrutiny, the nice part is that we can start without support for hard-links and add it later.
Re b): I need a recap of this. Can you point me at anything or write a post explaining how the UUID is used, how the CDRT operates in this area? My head is filled with FUSE and I need a refresh of the CDRT side (or maybe a summary doc of what you envisage would help).
I agree with building in that way, and also find it useful to set out what I’m aiming at long term even if we end up somewhere else! I can also try to see the stepping stones we can use along that incremental path (e.g. the point about being able to add hard links later).
@happybeing good stuff. I’m headed out the door now. Will give it some thought and get back to you later.
re UUID, the tree is logically composed of a set of triples: (parent_id, metadata, child_id). There is a requirement that the parent_id and child_id must be globally unique, such that any actor (replica) can create one and it will not conflict with an ID created by another actor.
The metadata can contain things like name, size, etc. So we could stick a u64 in there, perhaps generated by some kind of hash of the uuid (simplest, though collisions possible). Or possibly it could be implemented as a g_counter, which is a crdt grow-only counter type. We probably would need to somehow index name and ino fields of the metadata for fast lookup in large-directory situations.
Ok, I’ve been thinking more about how to implement hard-links, and I have an idea.
first, the issue is that:
rather than placing file data directly in the leaf nodes of the tree, the leaf node must reference the file inode. Thus references to the same file inode can appear in multiple leaf nodes of the tree
Now my original thinking was that:
So that looks like:
root - file1 (uuid<555>, ino<5>, meta...) -------> <XorName1> --------> file content - file2 (uuid(556>, ino<6>, meta...) -------> <XorName1> ----/
But that isn’t quite what is described in the above quote.
I believe “the unix way” and what is described in the quote above, looks like:
root - file1 (ino<5>) ---------> inode_entry(ino<5>, ref_cnt<2>, meta...) --> file contents - file2 (ino<5>) ----/
So, the challenge then is how to achieve the indirection in the second example, using our crdt-tree.
My first thought was that we could store the inode_entry data in an immutable object, but this is pretty terrible because its basically useless for working offline, would have super slow lookups, the XorName is based on changing metadata, etc, etc. I hate it.
So, the question becomes: how do we store the inode_entry in the crdt-tree itself? Well, I think there is a way.
The paper discusses that the data type supports multiple root nodes, not just one. The authors call this a
forest, and suggest a method to implement deletions by creating a trash node as a sibling to filesystem root, and each delete (final unlink) is implemented as a move to the trash. I recently implemented a test case for this, and the output visually demonstrates:
$ php tree.php test_move_to_trash Initial tree - forest - root - home - bob - project - a - a - b - b - a - b - trash After project moved to trash (deleted) on both replicas - forest - root - home - bob - trash - project - a - a - b - b - a - b Delete op is now causally stable, so we can empty trash: - forest - root - home - bob - trash
So my idea now is that we can create another sibling of root called
inodes and we store each inode_entry as a child of inodes, in the tree-node’s metadata, eg:
- forest - root - file1 - file2 - inodes - 125 - 126 - 127 - trash
So going back to our original examples, now we get:
- root - file1 (uuid<555>,ino<5>, inode_uuid<123>) - file2 (uuid<556>,ino<5>, inode_uuid<123>) - inodes - (uuid<123>, ino<5>, ref_cnt<2>, meta...) --> XorName ---> file content - trash
When unlinking file2, file2 node would be moved to trash and the inode ref_cnt decremented, and then we have:
- root - file1 (uuid<555>,ino<5>, inode_uuid<123>) - inodes - (uuid<123>, ino<5>, ref_cnt<1>) --> XorName ---> file content - trash - file2 (uuid<556>,ino<5>, inode_uuid<123>)
and when unlinking file1, then both file1 and the inode entry would be moved to trash, and only root is left.
- root - inodes - trash - file1 (uuid<555>,ino<5>, inode_uuid<123>) - file2 (uuid<556>,ino<5>, inode_uuid<123>) - (uuid<123>, ino<5>, ref_cnt<1>) --> XorName ---> file content
Hopefully that is clear enough for others to follow my train of thought. @happybeing please let me know what you think.
I think this works. At least I see a clear enough path that I should be able to try out a test case with the prototype, as I did for moves to trash and emptying trash.
This is genius @danda and I’m kicking myself for not thinking of it. I can’t see any problems with this, and it seems very clean.
I need to look into a question I have about inode reference counts and a couple of other things that I don’t understand, and will get onto the test FUSE with the aim of having it ready by Friday.
I might try using the polyfuse which David pointed to as a way of understanding whether it’s a good option.
Does that sound sensible?
EDIT: I found a good description of POSIX filesystem implementation which explains how the things I’ve summarised in the POSIX topic fit together, including how inodes are created and manipulated (here) followed by the set of inode operations (here).
It’s well worth a read, and I’m wondering if it makes sense for us to try to mirror these APIs.
I suggest we try, because this will act is documentation for how things work without us having to write it, and we might also find a filesystem written in Rust that we can sit on top of this to implement our SAFE FS API, which could be a big win. Anyway, have a read and let’s try and figure out if we can do this on top of your node tree.
Just catching up on these which I’ve not answered so far…
Fair enough, I’ll just remove that suggestion, and I’ve added a note that we may implement this using existing traits such as fuse::Filesystem, and generate language bindings that match the FUSE APIs from this.
Consider all that just a placeholder (putting something there encourages readers to think about it).
As mentioned I’m starting to think about this. I keep finding holes in my knowledge and having to go look for info, so it is proving a slow start but getting there. There are several examples I could use for a framework.
What I need to do is define how it will interact with your FileTree which I’ll share as soon as I think I have somehting, or if you get there first please tell me what will work for you.
@danda from your last two replies can I take it that you have a workable plan (both re UUID as perhaps G-counter, and for hard-links using refs to a list of inodes)?
I’ll assume so for now. BTW, in my search for Rust crates related to filesystems I turned up a few interesting things. The following might be useful for testing:
If you can do so quickly, please can you summarise the operations we will have available on the TreeCDRT so I can look into how to implement the FS APIs using them. For example, is moving a parent node an atomic API or do with have the option to move it in steps such as: 1) remove inode-A from its parent inode-B, 2) add inode-A as child of inode-C.
Also, what would be the operations to mutate metadata?
If you don’t have time I assume I can get them from the paper.
minor correction: ino(u64) as g_counter.
Currently both are untested ideas, so “hopefully workable plan” might be more accurate. I intend to experiment with them this week.
There is only one operation on a crdt-tree node. The move operation, which is atomic.
The tree-crdt move op looks like: move(parent_id, metadata, child_id). where:
Logical op to move op mapping:
Having only a single type of operation makes the crdt easier to reason about.
@happybeing an interesting aspect of the paper we will eventually need to deal with is this:
One final type of conflict that we have not discussed so far is multiple child nodes with the same parent and the same metadata. For example in a filesystem, two users could concurrently create files with the same name in the same directory. Our algorithm does not prevent such a conflict, but simply retains both child nodes. In practice, the collision would be resolved by making the filenames distinct, e.g. by appending a replica identifier to the filenames.
Appending a replica identifier to each filename is a bit gross, but may be best option.
This also relates to storing the ino(u64) in the metadata. If each replica were to simply increment the last known ino(u64) when adding a node, then we would get conflicts. I further realized gcounter doesn’t really help with ino identifiers because it is just a shared counter, there is no global uniqueness guarantee. So maybe we are back to the idea of hashing the uuid (unless/until a better strategy presents itself).
I made a post about solving the inode number allocation problem with a couple of new ideas. We can just use the UUID hash for now, but these might be options for improvement:
I think/hope this can be simplified to:
- root - file1 (uuid<555>,inode_uuid<123>) - file2 (uuid<556>,inode_uuid<123>) - inodes - (uuid<123>, ino<5>, ref_cnt<2>, meta...) --> XorName ---> file content - trash
Here we have removed the redundant ino<5> field from file1 and file2 metadata.
The inode_entry can still be looked up directly via 5.
The cost is that when looking up, ie lookup(), the inode number via path, eg /root/file1, an indirection (1 index lookup) is required to /inodes/123 to find ino<5>.
The benefit is that we save disk space and bandwidth because we are not duplicating the ino field everywhere.
That makes sense as far as I can tell at this stage.
Let me also answer by describing what I think this looks like from the filesystem API viewpoint.
The filesystem doesn’t need to know about what you’ve labelled nodes or trash, so we’re only concerned with this part of the
- root - file1 (uuid<555>,inode_uuid<123>) - file2 (uuid<556>,inode_uuid<123>)
Which corresponds to a file structure on the mounted device of:
Where both the above are links to the same file, ino<5>.
In the filesystem, the root is also an inode (directory ino<1>) with at least two directory entries (for
..) plus in this example two more for file1 and file2.
We don’t need to store entries for
.., instead we can fill them out when needed using the self and parent ino numbers. We could also fake the root inode but I think it will simplify the code in both safe-fs and
TreeCDRT to have it in the tree structure.
So, in the example (and for the the root directory only) both
.. will resolve to the root inode ino<1>, whereas all in other cases
.. resolves to the ino of the parent. And file1 and file2 both resolve to ino<5>.
Note that in filesystem implementations (eg the Linux VFS) ino<0> is used to initialise an empty directory entry, then the inode object is created and metadata initialised (eg ino, creation time, group, owner, permissions etc) and the ino can be assigned to the directory entry.
So I think you need to respect the conventions for ino<0> and ino<1>, but other than that allocating them can be left to the
Feels like it’s coming together!
EDIT: I got the idea that the root directory has ino<1> from the ‘hello’ FUSE examples, but it seems not always the case and I’ve not yet found out how you discover the ino of root! I’ll leave the above for now, but note:
…most contemporary Unix and Unix-like systems seem to use inode number 1 for tracking bad blocks, pushing the root up to inode number 2 (ref to)
…until I find something definitive.