FileTree CRDT for SAFE Network

happybeing · July 30, 2020, 5:34pm

@danda just to say I’m working away both to improve my understanding of FUSE and filesystems, and to try and come up with some thoughts - maybe even answers - to the questions we have. I’ve covered a lot of ground and clarified some things by writing about filesystems and how to implement them on the FileTree with CDRT etc. I have some more to do and will then respond to your posts above.

Great to see you making progress with the Tree CDRT.

happybeing · July 31, 2020, 5:06pm

Ditto. I find it useful to write things down even if they get torn up so don’t take anything I come up with as the way I think it should be.

I have been mixing research about filesystems with thinking about implementation of SAFE FS for FUSE and marrying these with how I imagine the FileTree will need to work, and have made notes about possible implementations. But rather than force a lot of reading on anyone I’ll respond to your questions and drop bit in where relevant.

The billion dollar question. I’ve been thinking about this and have more investigation to do to establish how closely SAFE FS should try to emulate typical inode features, or whether for example, it can create and destroy them more like file handles. I don’t know if that’s a useful idea or not, or whether to place that in SAFE FS or FileTree. For example, it may help us by decoupling the needs of the FileTree/TreeCDRT node identifiers from those exposed by SAFE FS.

I’ve summarised my understanding of Unix/Linux filesystem, including inodes (here). There are some things I probably need to look into further, so I made that a Wiki topic for this which can be updated and discussed separately.

I’d also like to know how the idea of a u64 node identifier might work (or not) with a TreeCDRT if you have any thoughts about this. More on that in a bit.

I don’t know about the CDRT side of this but see you’ve made progress. I have a couple of thoughts of how to manage the truncation to keep it trim without losing functionality:

I think it’s desirable to be able to access the full log rather than ever lose it completely. I think each replica can do this, by moving any pruned entries to a Sequence owned by the replica so that the full log can be be reconstructed, while only a shorter one is needed for most operations. This will preserve the ability to roll any FileTree back through all its previous states, or to merge any two FileTree replicas regardless of age and pruning. Maybe this is a ‘for later’ feature, but we could make preserving pruned entries the priority and implement features to make use of this later.
it might be helpful for users to be able to control how and when the log is pruned to suit different use cases. This could either be set when creating the first replica, or perhaps something that can be modified at any time.
EDIT: Another thought as I think truncation has a second aspect. The first is trimming the log so it doesn’t become so large that performance degrades unacceptably.

The second is deciding to have a cut off, before which no replica is merged if it has unmerged entries proper to the cut off. Why might that be useful? I’m not sure, but it occurred to me while I was thinking about the inode problem, trying to avoid clashes for numbers allocated independently. I came up with the following solution which I think would require us to limit how far back mergers should be allowed to go. The downside of that is that all topics will have to sync to stay ahead of the cut-off, or face their changes being blocked. That is recoverable though through a user managed re-sync, and I think for most use cases unlikely to be an issue, so how might having a cut off help as described in: Allocating inode numbers using pre-allocated blocks

I’ve not thought about this so idiot mode activated… The FileTree will be responsible for this because it will be a part of the CDRT operations, and so it will need to keep track of the metadata for each node, serialise it and retrieve it etc.

As for how to serialise it I’m not sure. Does it make sense to have a column in the log for each attribute or to have a single column for all metadata? In the latter case perhaps the metadata is serialised to a Sequence, in the former to the log itself but using run-length-encoding to avoid having to store values which rarely change, or maybe a bit of both?

Assuming this is just more metadata that’s something we could bolt on later so not an urgent question, but in general I’d say we try to implement everything we can in the long term in ways that don’t hurt performance until they are made use of by an application.

The init() operation is used to inform us about the characteristics and parameters of the FUSE kernel, and allows us to tweak some things if we want. Nothing exciting, but will need attention at some point (see docs).

You may be more interested in how we start up when a user tries to create or mount a SAFE FS filesystem. I expect we’ll have an API for creating a new FileTree object and from that object gain access to a SAFE FS API which SAFE FUSE can use to provide the filesystem operations on the FileTree.

So for example a safe-fuse CLI will be invoked with parameters telling it to either create a brand new filesystem (i.e. create an empty FileTree object for it), or an existing replica (i.e. fetch an existing FileTree from the network), and mount it at a given path. Once safe-fuse has a FileTree it obtains a SAFE FS API from it, then passes its FUSE operations to libfuse along with the requested mount point.

(Not necessarily in that order! As in, it might be better to create the mount first but block any operations until we have obtained the FileTree rather than keep the user waiting for that before failing the operation because they gave an invalid mount point.)

Another ‘I think’: from the application side you must always start with a path (you can’t just have an inode number and begin using that, is my understanding - except perhaps for the ‘root’). From the path you can obtain an inode which identifies the thing for subsequent operations on the same thing (typically a file or directory). This is hidden from view when using high-level FUSE. With low-level FUSE, the you call lookup() to get an inode number, which you then use for subsequent operations on the object it represents (e.g. a file or directory). The lookup() increments a hard-link count for the inode, so your process now effectively has a hard-link to the object, which is released when you call forget() which decrements the hard-link count. I see these hard-links as different (ephemeral) from hard-links from a directory entry to a file (enduring), and may be handled differently in our implementation. For example, if a process crashes before calling forget() it is probably better for the ‘link’ to die with it than to leave the inode in a state where its hard-link count will not reach zero when all its other hard-links have been removed.

Part of my to-do list is to look into this: how important it is that SAFE FS inodes correspond to typical filesystem inodes as it may be useful to deviate from this if it makes life easier for the FileTree/TreeCDRT implementation and its interface to SAFE FS.

I imagine it typical filesystem initialises an empty fixed sized table of inode references which has entries filled in or erased as file system objects are created and destroyed.

In terms of the APIs, the filesystem will call our FUSE API with a path in order to create an inode. We respond with the inode number, and it will use that for subsequent operations. So we would probably create an entry in the tree at this point, allocate it a u64 and respond with that.

For example:

OS-FUSE obtains the inode number for an entry in the root (‘/’) using an inode number of 1, so to get the inode for ‘/tmp’ it will call lookup(req_handle, 1, 'tmp')
If ‘/tmp’ exists we respond with a value, say I’ll call tmp_inode (which might be 2).
OS-FUSE then calls our create(req, tmp_inode, 'bar') to create the file entry.
We find the FileTree node which has an id of tmp_inode, create a child entry with name ‘bar’ and allocate an inode number for the entry (bar_inode) and initialise the metadata for the entry (e.g. creation time).
We respond with the inode number (bar_inode) of the new entry.

If that’s not complicated enough there’s a bit more to it than this so it is going to need a lot of reading and careful thought to ensure a clean, robust design.

Just read this part (possibly again), and it is what I had concluded after lots of thinking about how to implement hard-links. TL;DR: it works so long as we only need links to leaf nodes. This is ok IMO because only MacOS allows hard-links to directories.

Re a): Where the inode has only one reference, i.e. for directories, symlinks and files which don’t have extra hard-links, we could use the FileTree/TreeCDRT node to hold the metadata of the inode. In this case a file is just metadata, including an xor address (similar to a FilesContainer entry). So for most cases we just need a way to map between inode number and FileTree/TreeCDRT node in order to access the data of an inode.

In the case of a hard linked file, we will need a separate object to hold the inode data which can be referenced by every entry which hard-links to it from a directory. I have a scheme in mind where we store that in a Sequence but I won’t go into details here. If this passes scrutiny, the nice part is that we can start without support for hard-links and add it later.

Re b): I need a recap of this. Can you point me at anything or write a post explaining how the UUID is used, how the CDRT operates in this area? My head is filled with FUSE and I need a refresh of the CDRT side (or maybe a summary doc of what you envisage would help).

I agree with building in that way, and also find it useful to set out what I’m aiming at long term even if we end up somewhere else! I can also try to see the stepping stones we can use along that incremental path (e.g. the point about being able to add hard links later).

danda · July 31, 2020, 5:49pm

@happybeing good stuff. I’m headed out the door now. Will give it some thought and get back to you later.

re UUID, the tree is logically composed of a set of triples: (parent_id, metadata, child_id). There is a requirement that the parent_id and child_id must be globally unique, such that any actor (replica) can create one and it will not conflict with an ID created by another actor.

The metadata can contain things like name, size, etc. So we could stick a u64 in there, perhaps generated by some kind of hash of the uuid (simplest, though collisions possible). Or possibly it could be implemented as a g_counter, which is a crdt grow-only counter type. We probably would need to somehow index name and ino fields of the metadata for fast lookup in large-directory situations.

danda · August 1, 2020, 1:28am

Ok, I’ve been thinking more about how to implement hard-links, and I have an idea.

first, the issue is that:

rather than placing file data directly in the leaf nodes of the tree, the leaf node must reference the file inode. Thus references to the same file inode can appear in multiple leaf nodes of the tree

Now my original thinking was that:

each tree node is an inode entry.
our file content is stored as an ImmutableData and referenced by XorName from leaf nodes.
Given 1 and 2, multiple leaf nodes may point at the same XorName, which effectively implements hard-links. (for files ony)

So that looks like:

root
  - file1 (uuid<555>, ino<5>, meta...)  ------->  <XorName1> --------> file content
  - file2 (uuid(556>, ino<6>, meta...)  ------->  <XorName1> ----/

But that isn’t quite what is described in the above quote.

I believe “the unix way” and what is described in the quote above, looks like:

root
  - file1 (ino<5>)   --------->  inode_entry(ino<5>, ref_cnt<2>, meta...) --> file contents
  - file2 (ino<5>)   ----/

So, the challenge then is how to achieve the indirection in the second example, using our crdt-tree.

My first thought was that we could store the inode_entry data in an immutable object, but this is pretty terrible because its basically useless for working offline, would have super slow lookups, the XorName is based on changing metadata, etc, etc. I hate it.

So, the question becomes: how do we store the inode_entry in the crdt-tree itself? Well, I think there is a way.

The paper discusses that the data type supports multiple root nodes, not just one. The authors call this a forest, and suggest a method to implement deletions by creating a trash node as a sibling to filesystem root, and each delete (final unlink) is implemented as a move to the trash. I recently implemented a test case for this, and the output visually demonstrates:

$ php tree.php test_move_to_trash
Initial tree
- forest
  - root
    - home
      - bob
        - project
          - a
            - a
            - b
          - b
            - a
            - b
  - trash

After project moved to trash (deleted) on both replicas
- forest
  - root
    - home
      - bob
  - trash
    - project
      - a
        - a
        - b
      - b
        - a
        - b

Delete op is now causally stable, so we can empty trash:
- forest
  - root
    - home
      - bob
  - trash

So my idea now is that we can create another sibling of root called inodes and we store each inode_entry as a child of inodes, in the tree-node’s metadata, eg:

- forest
  - root
    - file1
    - file2
  - inodes
    - 125
    - 126
    - 127
  - trash

So going back to our original examples, now we get:

- root
  - file1 (uuid<555>,ino<5>, inode_uuid<123>)
  - file2 (uuid<556>,ino<5>, inode_uuid<123>)
- inodes
  - (uuid<123>, ino<5>, ref_cnt<2>, meta...)  --> XorName ---> file content
- trash

When unlinking file2, file2 node would be moved to trash and the inode ref_cnt decremented, and then we have:

- root
  - file1 (uuid<555>,ino<5>, inode_uuid<123>)
- inodes
  - (uuid<123>, ino<5>, ref_cnt<1>)  --> XorName ---> file content
- trash
  - file2 (uuid<556>,ino<5>, inode_uuid<123>)

and when unlinking file1, then both file1 and the inode entry would be moved to trash, and only root is left.

- root
- inodes
- trash
  - file1 (uuid<555>,ino<5>, inode_uuid<123>)
  - file2 (uuid<556>,ino<5>, inode_uuid<123>)
  - (uuid<123>, ino<5>, ref_cnt<1>)  --> XorName ---> file content

Hopefully that is clear enough for others to follow my train of thought. @happybeing please let me know what you think.

I think this works. At least I see a clear enough path that I should be able to try out a test case with the prototype, as I did for moves to trash and emptying trash.

happybeing · August 1, 2020, 9:24am

This is genius @danda and I’m kicking myself for not thinking of it. I can’t see any problems with this, and it seems very clean.

I need to look into a question I have about inode reference counts and a couple of other things that I don’t understand, and will get onto the test FUSE with the aim of having it ready by Friday.

I might try using the polyfuse which David pointed to as a way of understanding whether it’s a good option.

Does that sound sensible?

EDIT: I found a good description of POSIX filesystem implementation which explains how the things I’ve summarised in the POSIX topic fit together, including how inodes are created and manipulated (here) followed by the set of inode operations (here).

It’s well worth a read, and I’m wondering if it makes sense for us to try to mirror these APIs.

I suggest we try, because this will act is documentation for how things work without us having to write it, and we might also find a filesystem written in Rust that we can sit on top of this to implement our SAFE FS API, which could be a big win. Anyway, have a read and let’s try and figure out if we can do this on top of your node tree.

happybeing · August 1, 2020, 4:04pm

Just catching up on these which I’ve not answered so far…

Fair enough, I’ll just remove that suggestion, and I’ve added a note that we may implement this using existing traits such as fuse::Filesystem, and generate language bindings that match the FUSE APIs from this.

Consider all that just a placeholder (putting something there encourages readers to think about it).

As mentioned I’m starting to think about this. I keep finding holes in my knowledge and having to go look for info, so it is proving a slow start but getting there. There are several examples I could use for a framework.

What I need to do is define how it will interact with your FileTree which I’ll share as soon as I think I have somehting, or if you get there first please tell me what will work for you.

happybeing · August 2, 2020, 10:01am

@danda from your last two replies can I take it that you have a workable plan (both re UUID as perhaps G-counter, and for hard-links using refs to a list of inodes)?

I’ll assume so for now. BTW, in my search for Rust crates related to filesystems I turned up a few interesting things. The following might be useful for testing:

assert_fs - Filesystem fixtures and assertions for testing (libs.rs)
test-generator - enumerating entries according to file-system pattern and generating a test function for each entry (libs.rs)
integrity-checker - for backups and filesystems (lib.rs)

If you can do so quickly, please can you summarise the operations we will have available on the TreeCDRT so I can look into how to implement the FS APIs using them. For example, is moving a parent node an atomic API or do with have the option to move it in steps such as: 1) remove inode-A from its parent inode-B, 2) add inode-A as child of inode-C.

Also, what would be the operations to mutate metadata?

If you don’t have time I assume I can get them from the paper.

danda · August 2, 2020, 2:52pm

minor correction: ino(u64) as g_counter.

Currently both are untested ideas, so “hopefully workable plan” might be more accurate. I intend to experiment with them this week.

There is only one operation on a crdt-tree node. The move operation, which is atomic.

The tree-crdt move op looks like: move(parent_id, metadata, child_id). where:

parent_id and child_id are uuid.
metadata is application-defined. (FileTree will define a structure)

Logical op to move op mapping:

creation of a new node is moving a child with never before seen uuid.
creation of a root node is is moving a child to parent_id === null/none.
An unlink/delete is a move to trash.
A rename is a move to the same parent, with modified metadata. (as any metadata change)

Having only a single type of operation makes the crdt easier to reason about.

danda · August 2, 2020, 3:28pm

@happybeing an interesting aspect of the paper we will eventually need to deal with is this:

One final type of conflict that we have not discussed so far is multiple child nodes with the same parent and the same metadata. For example in a filesystem, two users could concurrently create files with the same name in the same directory. Our algorithm does not prevent such a conflict, but simply retains both child nodes. In practice, the collision would be resolved by making the filenames distinct, e.g. by appending a replica identifier to the filenames.

Appending a replica identifier to each filename is a bit gross, but may be best option.

This also relates to storing the ino(u64) in the metadata. If each replica were to simply increment the last known ino(u64) when adding a node, then we would get conflicts. I further realized gcounter doesn’t really help with ino identifiers because it is just a shared counter, there is no global uniqueness guarantee. So maybe we are back to the idea of hashing the uuid (unless/until a better strategy presents itself).

happybeing · August 3, 2020, 1:37pm

I made a post about solving the inode number allocation problem with a couple of new ideas. We can just use the UUID hash for now, but these might be options for improvement:

danda · August 3, 2020, 6:20pm

danda:

- root
  - file1 (uuid<555>,ino<5>, inode_uuid<123>)
  - file2 (uuid<556>,ino<5>, inode_uuid<123>)
- inodes
  - (uuid<123>, ino<5>, ref_cnt<2>, meta...)  --> XorName ---> file content
- trash

I think/hope this can be simplified to:

- root
  - file1 (uuid<555>,inode_uuid<123>)
  - file2 (uuid<556>,inode_uuid<123>)
- inodes
  - (uuid<123>, ino<5>, ref_cnt<2>, meta...)  --> XorName ---> file content
- trash

Here we have removed the redundant ino<5> field from file1 and file2 metadata.

The inode_entry can still be looked up directly via 5.

The cost is that when looking up, ie lookup(), the inode number via path, eg /root/file1, an indirection (1 index lookup) is required to /inodes/123 to find ino<5>.

The benefit is that we save disk space and bandwidth because we are not duplicating the ino field everywhere.

@happybeing thoughts?

happybeing · August 3, 2020, 7:25pm

That makes sense as far as I can tell at this stage.

Let me also answer by describing what I think this looks like from the filesystem API viewpoint.

The filesystem doesn’t need to know about what you’ve labelled nodes or trash, so we’re only concerned with this part of the TreeCDRT:

- root
  - file1 (uuid<555>,inode_uuid<123>)
  - file2 (uuid<556>,inode_uuid<123>)

Which corresponds to a file structure on the mounted device of:

/file1
/file2

Where both the above are links to the same file, ino<5>.

In the filesystem, the root is also an inode (directory ino<1>) with at least two directory entries (for . and ..) plus in this example two more for file1 and file2.

We don’t need to store entries for . and .., instead we can fill them out when needed using the self and parent ino numbers. We could also fake the root inode but I think it will simplify the code in both safe-fs and TreeCDRT to have it in the tree structure.

So, in the example (and for the the root directory only) both . and .. will resolve to the root inode ino<1>, whereas all in other cases .. resolves to the ino of the parent. And file1 and file2 both resolve to ino<5>.

Note that in filesystem implementations (eg the Linux VFS) ino<0> is used to initialise an empty directory entry, then the inode object is created and metadata initialised (eg ino, creation time, group, owner, permissions etc) and the ino can be assigned to the directory entry.

So I think you need to respect the conventions for ino<0> and ino<1>, but other than that allocating them can be left to the TreeCDRT.

Feels like it’s coming together!

EDIT: I got the idea that the root directory has ino<1> from the ‘hello’ FUSE examples, but it seems not always the case and I’ve not yet found out how you discover the ino of root! I’ll leave the above for now, but note:

…most contemporary Unix and Unix-like systems seem to use inode number 1 for tracking bad blocks, pushing the root up to inode number 2 (ref to)

…until I find something definitive.

danda · August 3, 2020, 10:24pm

Just a note about this…

ok, so i’ve started working on a little faux filesystem prototype/test case that will implements apis such as lookup(), mkdir, create, link, unlink(), etc.

The first api I started with is lookup(). It (primarily) returns an ino, but according to some docs I found, it is also supposed to increment the inode’s ref_cnt. Even though it is a purely local, read-only operation.

So that brings something into somewhat sharper focus. If I implement this as a move operation, it means that ops will flow across the network for purely local reads. That’s all fine and good for consistency, but it seems a bit overkill/inefficient.

I’m a bit tempted to implement it that way for now though, as I believe it was knuth who said premature optimization is the root of all evil.

happybeing · August 3, 2020, 10:34pm

Are you sure the ref count is the same as the link count? I thought it read like that at first reading but it doesn’t make sense. So I’m thinking there’s an in use count and a link count, but at this point that’s just my supposition.

One reason I rejected the first idea is that if a running process crashes after doing lookup, you risk using up inodes because they can never be deleted.

danda · August 3, 2020, 10:51pm

yeah, I wondered that too, and I believe you may be right. Will check the docs in more detail.

But it mostly boils down to the same thing. Either all of the changes in the crdt tree nodes are stored in the op log and (eventually) go across the network, or they don’t.

I/we need to think more about ramifications if some do not. For one thing, those local mods would presumably not be stored in operations log, so presumably would not not be recreated in the state after undo/redo apply cycle. Possibly a 2nd log for local ops must be kept. At which point I start to wonder if data type would still meet crdt requirements.

Another thought: fuse has a bunch of flags for which features a filesystem supports or not. So maybe there is a way to say we don’t support ref_cnt for lookups and other read ops…

danda · August 3, 2020, 10:57pm

ok, thinking about it some more…

I suppose we could have a fully separate data structure outside the crdt-tree that stores local-only info about each inode, since time of mount, such as use_cnt. Possibly this could live in memory, especially if we only need it for use_cnt, then only nodes that are actually in use need be stored.

happybeing · August 3, 2020, 11:02pm

Please keep and share your list of useful reference docs. It seems hard to find some stuff.

I’m hopeful we can use the sync operation as the boundary, because the filesystem operations must have been designed to cope with a similar (hopefully equivalent) problem. So we can try and match the filesystem API with what we can do in a FileTree, and see how well that works.

That’s what I’ve been thinking, but is yet to be established.

danda · August 3, 2020, 11:18pm

basically these:

C

Rust

danda · August 3, 2020, 11:57pm

Seems that the lookup refcount is optional. from the fuse-rs docs:

fn forget(&mut self, _req: &Request, _ino: u64, _nlookup: u64)

Forget about an inode. The nlookup parameter indicates the number of lookups previously performed on this inode. If the filesystem implements inode lifetimes, it is recommended that inodes acquire a single reference on each lookup, and lose nlookup references on each forget. The filesystem may ignore forget calls, if the inodes don’t need to have a limited lifetime. On unmount it is not guaranteed, that all referenced inodes will receive a forget message.

more detail, from http://fuse.996288.n3.nabble.com/forget-inodes-td9599.html

The VFS-fuse will lookup() based on a parent inode and a file name, resulting in an inode with a lookup count incremented by 1. The lookup count is also set to 1 for creation of a file. It is possible that you will receive multiple lookup requests, in which case you are to increase your lookup counter.

While the inode is looked-up, the file-system should keep a record in it’s cache because it can expect operations on that inode directly.

The kernel may forget inodes under three circumstances:

The inode is deleted and it is not open.

Pressure on the cache causes the kernel to free some cached information including that for the inode.

The file-system is unmounted (similar to 2, but freeing everything about the file-system) - although perhaps the default fuse implementation here is to suppress forget operations in this case (in the multi-threaded fuse loop function).

Essentially you should be keeping track of the inode while you have a lookup count that is larger than 0 so that you can do operations on it instantly without having to lookup the inode yourself.

Unrelated, but interesting… it seems the ino can be ephemeral. So it could also be stored in the local-only data, mapped to/from the inode_entry.

The ‘generation’ field is important if your inode number generator may generate different inode numbers at a different times for the same object. This is uncommon for on-disk file systems, but it may happen for network file systems (like NFS, see 1).

It is mentioned in 1 that a server may use a different set of (fuse) inode numbers/(nfs) file handles after a restart. If that happens, it is possible that the new inode numbers map to objects in a different way then the inode numbers which were given out before the server restart. A client could use a different generation number for the set of inode before the restart and for the set of inodes after the restart to make clear which inode is meant.

If your file system has a static generation scheme for inodes (where a inode number always points to the same object), there is no need to use the generation number and it may be used to extend the inode number.

danda · August 4, 2020, 6:49am

Ok, so the generation (ephemeral) inode thing from my previous comment seems to solve our inode collision problem… because it means that ino can remain a strictly local thing, which is great, and makes a lot of sense for any kind of networked file system.

So our previous example becomes:

- root
  - file1 (uuid<555>,inode_uuid<123>)
  - file2 (uuid<556>,inode_uuid<123>)
- inodes
  - (uuid<123>, meta...)  --> XorName ---> file content
- trash

plus, we have a purely local hashmap/dictionary/index of uuid => local_inode_entry:

123 ---> (ino<5>, uuid<123>, lookup_cnt<2>)

and another that is ino => local_inode_entry:

5 ---> (ino<5>, uuid<123>, lookup_cnt<2>)

This way, we can quickly lookup by the ino to find uuid (eg for stat) or by the uuid to find the ino (eg for lookup).

It would be nicer if we could use a single index, but I don’t immediately see a good way to do that at this late hour.