FileTree CRDT for SAFE Network

happybeing · August 1, 2020, 9:24am

This is genius @danda and I’m kicking myself for not thinking of it. I can’t see any problems with this, and it seems very clean.

I need to look into a question I have about inode reference counts and a couple of other things that I don’t understand, and will get onto the test FUSE with the aim of having it ready by Friday.

I might try using the polyfuse which David pointed to as a way of understanding whether it’s a good option.

Does that sound sensible?

EDIT: I found a good description of POSIX filesystem implementation which explains how the things I’ve summarised in the POSIX topic fit together, including how inodes are created and manipulated (here) followed by the set of inode operations (here).

It’s well worth a read, and I’m wondering if it makes sense for us to try to mirror these APIs.

I suggest we try, because this will act is documentation for how things work without us having to write it, and we might also find a filesystem written in Rust that we can sit on top of this to implement our SAFE FS API, which could be a big win. Anyway, have a read and let’s try and figure out if we can do this on top of your node tree.

happybeing · August 1, 2020, 4:04pm

Just catching up on these which I’ve not answered so far…

Fair enough, I’ll just remove that suggestion, and I’ve added a note that we may implement this using existing traits such as fuse::Filesystem, and generate language bindings that match the FUSE APIs from this.

Consider all that just a placeholder (putting something there encourages readers to think about it).

As mentioned I’m starting to think about this. I keep finding holes in my knowledge and having to go look for info, so it is proving a slow start but getting there. There are several examples I could use for a framework.

What I need to do is define how it will interact with your FileTree which I’ll share as soon as I think I have somehting, or if you get there first please tell me what will work for you.

happybeing · August 2, 2020, 10:01am

@danda from your last two replies can I take it that you have a workable plan (both re UUID as perhaps G-counter, and for hard-links using refs to a list of inodes)?

I’ll assume so for now. BTW, in my search for Rust crates related to filesystems I turned up a few interesting things. The following might be useful for testing:

assert_fs - Filesystem fixtures and assertions for testing (libs.rs)
test-generator - enumerating entries according to file-system pattern and generating a test function for each entry (libs.rs)
integrity-checker - for backups and filesystems (lib.rs)

If you can do so quickly, please can you summarise the operations we will have available on the TreeCDRT so I can look into how to implement the FS APIs using them. For example, is moving a parent node an atomic API or do with have the option to move it in steps such as: 1) remove inode-A from its parent inode-B, 2) add inode-A as child of inode-C.

Also, what would be the operations to mutate metadata?

If you don’t have time I assume I can get them from the paper.

danda · August 2, 2020, 2:52pm

minor correction: ino(u64) as g_counter.

Currently both are untested ideas, so “hopefully workable plan” might be more accurate. I intend to experiment with them this week.

There is only one operation on a crdt-tree node. The move operation, which is atomic.

The tree-crdt move op looks like: move(parent_id, metadata, child_id). where:

parent_id and child_id are uuid.
metadata is application-defined. (FileTree will define a structure)

Logical op to move op mapping:

creation of a new node is moving a child with never before seen uuid.
creation of a root node is is moving a child to parent_id === null/none.
An unlink/delete is a move to trash.
A rename is a move to the same parent, with modified metadata. (as any metadata change)

Having only a single type of operation makes the crdt easier to reason about.

danda · August 2, 2020, 3:28pm

@happybeing an interesting aspect of the paper we will eventually need to deal with is this:

One final type of conflict that we have not discussed so far is multiple child nodes with the same parent and the same metadata. For example in a filesystem, two users could concurrently create files with the same name in the same directory. Our algorithm does not prevent such a conflict, but simply retains both child nodes. In practice, the collision would be resolved by making the filenames distinct, e.g. by appending a replica identifier to the filenames.

Appending a replica identifier to each filename is a bit gross, but may be best option.

This also relates to storing the ino(u64) in the metadata. If each replica were to simply increment the last known ino(u64) when adding a node, then we would get conflicts. I further realized gcounter doesn’t really help with ino identifiers because it is just a shared counter, there is no global uniqueness guarantee. So maybe we are back to the idea of hashing the uuid (unless/until a better strategy presents itself).

happybeing · August 3, 2020, 1:37pm

I made a post about solving the inode number allocation problem with a couple of new ideas. We can just use the UUID hash for now, but these might be options for improvement:

danda · August 3, 2020, 6:20pm

danda:

- root
  - file1 (uuid<555>,ino<5>, inode_uuid<123>)
  - file2 (uuid<556>,ino<5>, inode_uuid<123>)
- inodes
  - (uuid<123>, ino<5>, ref_cnt<2>, meta...)  --> XorName ---> file content
- trash

I think/hope this can be simplified to:

- root
  - file1 (uuid<555>,inode_uuid<123>)
  - file2 (uuid<556>,inode_uuid<123>)
- inodes
  - (uuid<123>, ino<5>, ref_cnt<2>, meta...)  --> XorName ---> file content
- trash

Here we have removed the redundant ino<5> field from file1 and file2 metadata.

The inode_entry can still be looked up directly via 5.

The cost is that when looking up, ie lookup(), the inode number via path, eg /root/file1, an indirection (1 index lookup) is required to /inodes/123 to find ino<5>.

The benefit is that we save disk space and bandwidth because we are not duplicating the ino field everywhere.

@happybeing thoughts?

happybeing · August 3, 2020, 7:25pm

That makes sense as far as I can tell at this stage.

Let me also answer by describing what I think this looks like from the filesystem API viewpoint.

The filesystem doesn’t need to know about what you’ve labelled nodes or trash, so we’re only concerned with this part of the TreeCDRT:

- root
  - file1 (uuid<555>,inode_uuid<123>)
  - file2 (uuid<556>,inode_uuid<123>)

Which corresponds to a file structure on the mounted device of:

/file1
/file2

Where both the above are links to the same file, ino<5>.

In the filesystem, the root is also an inode (directory ino<1>) with at least two directory entries (for . and ..) plus in this example two more for file1 and file2.

We don’t need to store entries for . and .., instead we can fill them out when needed using the self and parent ino numbers. We could also fake the root inode but I think it will simplify the code in both safe-fs and TreeCDRT to have it in the tree structure.

So, in the example (and for the the root directory only) both . and .. will resolve to the root inode ino<1>, whereas all in other cases .. resolves to the ino of the parent. And file1 and file2 both resolve to ino<5>.

Note that in filesystem implementations (eg the Linux VFS) ino<0> is used to initialise an empty directory entry, then the inode object is created and metadata initialised (eg ino, creation time, group, owner, permissions etc) and the ino can be assigned to the directory entry.

So I think you need to respect the conventions for ino<0> and ino<1>, but other than that allocating them can be left to the TreeCDRT.

Feels like it’s coming together!

EDIT: I got the idea that the root directory has ino<1> from the ‘hello’ FUSE examples, but it seems not always the case and I’ve not yet found out how you discover the ino of root! I’ll leave the above for now, but note:

…most contemporary Unix and Unix-like systems seem to use inode number 1 for tracking bad blocks, pushing the root up to inode number 2 (ref to)

…until I find something definitive.

danda · August 3, 2020, 10:24pm

Just a note about this…

ok, so i’ve started working on a little faux filesystem prototype/test case that will implements apis such as lookup(), mkdir, create, link, unlink(), etc.

The first api I started with is lookup(). It (primarily) returns an ino, but according to some docs I found, it is also supposed to increment the inode’s ref_cnt. Even though it is a purely local, read-only operation.

So that brings something into somewhat sharper focus. If I implement this as a move operation, it means that ops will flow across the network for purely local reads. That’s all fine and good for consistency, but it seems a bit overkill/inefficient.

I’m a bit tempted to implement it that way for now though, as I believe it was knuth who said premature optimization is the root of all evil.

happybeing · August 3, 2020, 10:34pm

Are you sure the ref count is the same as the link count? I thought it read like that at first reading but it doesn’t make sense. So I’m thinking there’s an in use count and a link count, but at this point that’s just my supposition.

One reason I rejected the first idea is that if a running process crashes after doing lookup, you risk using up inodes because they can never be deleted.

danda · August 3, 2020, 10:51pm

yeah, I wondered that too, and I believe you may be right. Will check the docs in more detail.

But it mostly boils down to the same thing. Either all of the changes in the crdt tree nodes are stored in the op log and (eventually) go across the network, or they don’t.

I/we need to think more about ramifications if some do not. For one thing, those local mods would presumably not be stored in operations log, so presumably would not not be recreated in the state after undo/redo apply cycle. Possibly a 2nd log for local ops must be kept. At which point I start to wonder if data type would still meet crdt requirements.

Another thought: fuse has a bunch of flags for which features a filesystem supports or not. So maybe there is a way to say we don’t support ref_cnt for lookups and other read ops…

danda · August 3, 2020, 10:57pm

ok, thinking about it some more…

I suppose we could have a fully separate data structure outside the crdt-tree that stores local-only info about each inode, since time of mount, such as use_cnt. Possibly this could live in memory, especially if we only need it for use_cnt, then only nodes that are actually in use need be stored.

happybeing · August 3, 2020, 11:02pm

Please keep and share your list of useful reference docs. It seems hard to find some stuff.

I’m hopeful we can use the sync operation as the boundary, because the filesystem operations must have been designed to cope with a similar (hopefully equivalent) problem. So we can try and match the filesystem API with what we can do in a FileTree, and see how well that works.

That’s what I’ve been thinking, but is yet to be established.

danda · August 3, 2020, 11:18pm

basically these:

C

Rust

danda · August 3, 2020, 11:57pm

Seems that the lookup refcount is optional. from the fuse-rs docs:

fn forget(&mut self, _req: &Request, _ino: u64, _nlookup: u64)

Forget about an inode. The nlookup parameter indicates the number of lookups previously performed on this inode. If the filesystem implements inode lifetimes, it is recommended that inodes acquire a single reference on each lookup, and lose nlookup references on each forget. The filesystem may ignore forget calls, if the inodes don’t need to have a limited lifetime. On unmount it is not guaranteed, that all referenced inodes will receive a forget message.

more detail, from http://fuse.996288.n3.nabble.com/forget-inodes-td9599.html

The VFS-fuse will lookup() based on a parent inode and a file name, resulting in an inode with a lookup count incremented by 1. The lookup count is also set to 1 for creation of a file. It is possible that you will receive multiple lookup requests, in which case you are to increase your lookup counter.

While the inode is looked-up, the file-system should keep a record in it’s cache because it can expect operations on that inode directly.

The kernel may forget inodes under three circumstances:

The inode is deleted and it is not open.

Pressure on the cache causes the kernel to free some cached information including that for the inode.

The file-system is unmounted (similar to 2, but freeing everything about the file-system) - although perhaps the default fuse implementation here is to suppress forget operations in this case (in the multi-threaded fuse loop function).

Essentially you should be keeping track of the inode while you have a lookup count that is larger than 0 so that you can do operations on it instantly without having to lookup the inode yourself.

Unrelated, but interesting… it seems the ino can be ephemeral. So it could also be stored in the local-only data, mapped to/from the inode_entry.

The ‘generation’ field is important if your inode number generator may generate different inode numbers at a different times for the same object. This is uncommon for on-disk file systems, but it may happen for network file systems (like NFS, see 1).

It is mentioned in 1 that a server may use a different set of (fuse) inode numbers/(nfs) file handles after a restart. If that happens, it is possible that the new inode numbers map to objects in a different way then the inode numbers which were given out before the server restart. A client could use a different generation number for the set of inode before the restart and for the set of inodes after the restart to make clear which inode is meant.

If your file system has a static generation scheme for inodes (where a inode number always points to the same object), there is no need to use the generation number and it may be used to extend the inode number.

danda · August 4, 2020, 6:49am

Ok, so the generation (ephemeral) inode thing from my previous comment seems to solve our inode collision problem… because it means that ino can remain a strictly local thing, which is great, and makes a lot of sense for any kind of networked file system.

So our previous example becomes:

- root
  - file1 (uuid<555>,inode_uuid<123>)
  - file2 (uuid<556>,inode_uuid<123>)
- inodes
  - (uuid<123>, meta...)  --> XorName ---> file content
- trash

plus, we have a purely local hashmap/dictionary/index of uuid => local_inode_entry:

123 ---> (ino<5>, uuid<123>, lookup_cnt<2>)

and another that is ino => local_inode_entry:

5 ---> (ino<5>, uuid<123>, lookup_cnt<2>)

This way, we can quickly lookup by the ino to find uuid (eg for stat) or by the uuid to find the ino (eg for lookup).

It would be nicer if we could use a single index, but I don’t immediately see a good way to do that at this late hour.

happybeing · August 4, 2020, 10:42am

Yes, I agree this should work and was at some point thinking along these lines - that ino’s could be treated like file handles locally (when I saw the ref counting). I hadn’t confirmed that though so hats off to you.

I don’t think we should worry about the two indexes. If we implement local ref counts the indexes can be purged if they grow too big. It should be easy to optimise later as you recommended earlier.

I think we know enough to think about what lives in each of safe-fuse, safe-fs, FileTree, and TreeCDRT and the boundaries between them.

I’ve been thinking safe-fuse and safe-fs could be thin layers if FileTree provides a POSIX style outer, or we could keep FileTree to managing inodes.

I keep coming round to thinking safe-fs could instead live inside FileTree. I like the idea that from the developers’ perspective they just create a FileTree, and from this can get the level of FS API they want.

No need to decide any of this yet, just sharing thoughts and interested to hear any views. safe-fs can be built separately now and rolled into FileTree later if we choose.

I’ve been occupied elsewhere but plan to work on the FUSE code today and will let you know when there’s something to look at.

happybeing · August 4, 2020, 2:17pm

@danda re:

I’m having a look at using the MemFs example from polyfuse (which is based on the rust Filesystem trait you linked to, but with async support from tokio). It has several examples, including the in memory filesystem which is all in one file.

The MemFs example looks promising because it is compact, defines types which I think we can just use now which look right, e.g. see struct INode (link), and notice it has refcount and links:

struct INode {
    attr: FileAttr,
    xattrs: HashMap<OsString, Arc<Vec<u8>>>,
    refcount: u64,
    links: u64,
    kind: INodeKind,
}

enum INodeKind {
    RegularFile(Vec<u8>),
    Directory(Directory),
    Symlink(Arc<OsString>),
}

At the bottom of the file is the the Filesystem trait and in-between are the implementation functions.

Using FileTreeFS in polyfuse-test

I’ve forked the above as polyfuse-test and added a FileTreeFS example in branch filetree-test.

I made a copy of MemFS example in ./examples/examples/filetreefs.rs (with all occurances of MemFS renamed to FileTreeFS). So it is a copy and runs exactly as the MemFS example.

To try it out:

Build

git clone https://github.com/theWebalyst/polyfuse-test
cd polyfuse-test
git checkout filetree-test

Make a mount point and mount it

cd polyfuse-test
mkdir mount
cargo run --example filetreefs mount

In another terminal

cd polyfuse-test
ls -ail mount
stat mount

I think the above gives you what you were asking for but let me know if not. It’s a working FUSE mount and template FS which can be modified to slot in a FileTree to do the actual lookup(), mknod(), mkdir() etc in the order that’s easiest for you.

I’m not sure if I can do any more code at this point without some instruction from you because I’m still unsure about defining traits etc or writing any code from scratch other than for learning. But if you can set a framework and give me a task I’ll have a go.

I suggest you fork the above and soon as you have something push it so I can have a look and maybe help out with sub-tasks. In the mean time I could pull together the proposed design into the document I started.

happybeing · August 4, 2020, 3:28pm

Thanks for these, added mine below.

Filesystems

A Five-Year Study of File-System Metadata (October 2007, PDF: online, [local](file:///home/mrh/Documents/MaidSafe/Research/filesystems/A%20Five-Year%20Study%20of%20File-System%20Metadata%20(October%202007).pdf))
inode - useful summary of Unix/POSIX implementation of inodes and hard links (Wikipedia)

Note: maybe only @POSIXMacOS allows hard links to directories? (see “Implications”)

UNIX/LINUX File System - Directories, inodes, Hard Links by Ian D. Allen (2019, course notes)
Linux Kernel > Filesystems (kernel.org)
- Linux VFS: superblock object, inode object
- ext4 Ext4 Disk Layout (kernel.org)

See “Special inodes” for special ino values (NOTE: 1 is not root, so maybe we must tell FUSE which is root?)

FUSE

The FUSE Protocol (john-millikin.com)

Gold: FUSE versions, protocol, implementation gotchas and debugging (more on @debugging running FUSE here)

libfuse wiki: Protocol Sketch by Csaba Henk (link)
FUSE protocol tutorial for Linux 2.6 by Péter Szabó (blogspot)
The FUSE Wire Protocol by Jakob Unterwurzacher (nuetzlich.net)
How can I create a userspace filesystem with FUSE without using libfuse? by Corbin Simpson (stackoverflow)
FUSE Protocol by Johan Rydberg (webarchive)
Linux Kernel > Filesystems > FUSE (kernel.org)

danda · August 5, 2020, 1:39am

@happybeing – good stuff. The polyfuse sample seems a good start. I notice its a bit different than the C api, eg I don’t see a create() call, maybe that is do_mknod() instead.

I’ve had my nose stuck deep in the code all day, so I have a bit of progress to report.

I started making a prototype filesystem class. This is a super quick/dirty implementation of a simplified fuse API. eg, it doesn’t bother with all the request/reply data structures, etc. The idea is to test/simulate/visualize the basic concepts we’ve been refining.

I was able to implement init(), lookup(), mkdir(), create(), and link(), and then to write a little test case that tries them out.

Code:

function test_fs() {
    // initial filesystem
    $fs = new filesystem();
    $fs->init(new replica());

    // display state
    $fs->print_current_state();

    // get ino for /
    $ino_root = $fs->lookup("/");

    // create /home/bob
    $ino_home = $fs->mkdir($ino_root, "home" );
    $ino_bob = $fs->mkdir($ino_home, "bob" );

    // create /home/bob/homework.txt and hard-link homework-link.txt
    $ino_homework = $fs->create($ino_bob, "homework.txt", 'c' );
    $fs->link($ino_homework, $ino_bob, "homework-link.txt");

    // display state
    $fs->print_current_state();
}

Output:

------- current filesystem state -------
- null => forest
  - 1000 => {"name":"root","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}
  - 1001 => {"name":"fileinodes","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}
  - 1002 => {"name":"trash","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}

local inode entries:
3 => {"ino":3,"tree_id":1000,"ref_count":1,"links":1,"is_file":false}
------- end state -------


------- current filesystem state -------
- null => forest
  - 1000 => {"name":"root","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}
    - 1003 => {"name":"home","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}
      - 1004 => {"name":"bob","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}
        - 1006 => {"name":"homework.txt","inode_id":1005}
        - 1007 => {"name":"homework-link.txt","inode_id":1005}
  - 1001 => {"name":"fileinodes","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}
    - 1005 => {"size":0,"ctime":1596590417,"mtime":1596590417,"kind":"file","xorname":null}
  - 1002 => {"name":"trash","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}

local inode entries:
3 => {"ino":3,"tree_id":1000,"ref_count":2,"links":1,"is_file":false}
4 => {"ino":4,"tree_id":1003,"ref_count":1,"links":1,"is_file":false}
5 => {"ino":5,"tree_id":1004,"ref_count":1,"links":1,"is_file":false}
6 => {"ino":6,"tree_id":1005,"ref_count":1,"links":2,"is_file":true}
------- end state -------

That’s as far as I’ve gotten, but working so far!

Note that the code is NOT creating inode entries for directories (or symlinks) under /fileinodes. This should save some space and a lot of unnecessary move operations.