FileTree CRDT for SAFE Network

danda · August 3, 2020, 10:51pm

yeah, I wondered that too, and I believe you may be right. Will check the docs in more detail.

But it mostly boils down to the same thing. Either all of the changes in the crdt tree nodes are stored in the op log and (eventually) go across the network, or they don’t.

I/we need to think more about ramifications if some do not. For one thing, those local mods would presumably not be stored in operations log, so presumably would not not be recreated in the state after undo/redo apply cycle. Possibly a 2nd log for local ops must be kept. At which point I start to wonder if data type would still meet crdt requirements.

Another thought: fuse has a bunch of flags for which features a filesystem supports or not. So maybe there is a way to say we don’t support ref_cnt for lookups and other read ops…

danda · August 3, 2020, 10:57pm

ok, thinking about it some more…

I suppose we could have a fully separate data structure outside the crdt-tree that stores local-only info about each inode, since time of mount, such as use_cnt. Possibly this could live in memory, especially if we only need it for use_cnt, then only nodes that are actually in use need be stored.

happybeing · August 3, 2020, 11:02pm

Please keep and share your list of useful reference docs. It seems hard to find some stuff.

I’m hopeful we can use the sync operation as the boundary, because the filesystem operations must have been designed to cope with a similar (hopefully equivalent) problem. So we can try and match the filesystem API with what we can do in a FileTree, and see how well that works.

That’s what I’ve been thinking, but is yet to be established.

danda · August 3, 2020, 11:18pm

basically these:

C

Rust

danda · August 3, 2020, 11:57pm

Seems that the lookup refcount is optional. from the fuse-rs docs:

fn forget(&mut self, _req: &Request, _ino: u64, _nlookup: u64)

Forget about an inode. The nlookup parameter indicates the number of lookups previously performed on this inode. If the filesystem implements inode lifetimes, it is recommended that inodes acquire a single reference on each lookup, and lose nlookup references on each forget. The filesystem may ignore forget calls, if the inodes don’t need to have a limited lifetime. On unmount it is not guaranteed, that all referenced inodes will receive a forget message.

more detail, from http://fuse.996288.n3.nabble.com/forget-inodes-td9599.html

The VFS-fuse will lookup() based on a parent inode and a file name, resulting in an inode with a lookup count incremented by 1. The lookup count is also set to 1 for creation of a file. It is possible that you will receive multiple lookup requests, in which case you are to increase your lookup counter.

While the inode is looked-up, the file-system should keep a record in it’s cache because it can expect operations on that inode directly.

The kernel may forget inodes under three circumstances:

The inode is deleted and it is not open.

Pressure on the cache causes the kernel to free some cached information including that for the inode.

The file-system is unmounted (similar to 2, but freeing everything about the file-system) - although perhaps the default fuse implementation here is to suppress forget operations in this case (in the multi-threaded fuse loop function).

Essentially you should be keeping track of the inode while you have a lookup count that is larger than 0 so that you can do operations on it instantly without having to lookup the inode yourself.

Unrelated, but interesting… it seems the ino can be ephemeral. So it could also be stored in the local-only data, mapped to/from the inode_entry.

The ‘generation’ field is important if your inode number generator may generate different inode numbers at a different times for the same object. This is uncommon for on-disk file systems, but it may happen for network file systems (like NFS, see 1).

It is mentioned in 1 that a server may use a different set of (fuse) inode numbers/(nfs) file handles after a restart. If that happens, it is possible that the new inode numbers map to objects in a different way then the inode numbers which were given out before the server restart. A client could use a different generation number for the set of inode before the restart and for the set of inodes after the restart to make clear which inode is meant.

If your file system has a static generation scheme for inodes (where a inode number always points to the same object), there is no need to use the generation number and it may be used to extend the inode number.

danda · August 4, 2020, 6:49am

Ok, so the generation (ephemeral) inode thing from my previous comment seems to solve our inode collision problem… because it means that ino can remain a strictly local thing, which is great, and makes a lot of sense for any kind of networked file system.

So our previous example becomes:

- root
  - file1 (uuid<555>,inode_uuid<123>)
  - file2 (uuid<556>,inode_uuid<123>)
- inodes
  - (uuid<123>, meta...)  --> XorName ---> file content
- trash

plus, we have a purely local hashmap/dictionary/index of uuid => local_inode_entry:

123 ---> (ino<5>, uuid<123>, lookup_cnt<2>)

and another that is ino => local_inode_entry:

5 ---> (ino<5>, uuid<123>, lookup_cnt<2>)

This way, we can quickly lookup by the ino to find uuid (eg for stat) or by the uuid to find the ino (eg for lookup).

It would be nicer if we could use a single index, but I don’t immediately see a good way to do that at this late hour.

happybeing · August 4, 2020, 10:42am

Yes, I agree this should work and was at some point thinking along these lines - that ino’s could be treated like file handles locally (when I saw the ref counting). I hadn’t confirmed that though so hats off to you.

I don’t think we should worry about the two indexes. If we implement local ref counts the indexes can be purged if they grow too big. It should be easy to optimise later as you recommended earlier.

I think we know enough to think about what lives in each of safe-fuse, safe-fs, FileTree, and TreeCDRT and the boundaries between them.

I’ve been thinking safe-fuse and safe-fs could be thin layers if FileTree provides a POSIX style outer, or we could keep FileTree to managing inodes.

I keep coming round to thinking safe-fs could instead live inside FileTree. I like the idea that from the developers’ perspective they just create a FileTree, and from this can get the level of FS API they want.

No need to decide any of this yet, just sharing thoughts and interested to hear any views. safe-fs can be built separately now and rolled into FileTree later if we choose.

I’ve been occupied elsewhere but plan to work on the FUSE code today and will let you know when there’s something to look at.

happybeing · August 4, 2020, 2:17pm

@danda re:

I’m having a look at using the MemFs example from polyfuse (which is based on the rust Filesystem trait you linked to, but with async support from tokio). It has several examples, including the in memory filesystem which is all in one file.

The MemFs example looks promising because it is compact, defines types which I think we can just use now which look right, e.g. see struct INode (link), and notice it has refcount and links:

struct INode {
    attr: FileAttr,
    xattrs: HashMap<OsString, Arc<Vec<u8>>>,
    refcount: u64,
    links: u64,
    kind: INodeKind,
}

enum INodeKind {
    RegularFile(Vec<u8>),
    Directory(Directory),
    Symlink(Arc<OsString>),
}

At the bottom of the file is the the Filesystem trait and in-between are the implementation functions.

Using FileTreeFS in polyfuse-test

I’ve forked the above as polyfuse-test and added a FileTreeFS example in branch filetree-test.

I made a copy of MemFS example in ./examples/examples/filetreefs.rs (with all occurances of MemFS renamed to FileTreeFS). So it is a copy and runs exactly as the MemFS example.

To try it out:

Build

git clone https://github.com/theWebalyst/polyfuse-test
cd polyfuse-test
git checkout filetree-test

Make a mount point and mount it

cd polyfuse-test
mkdir mount
cargo run --example filetreefs mount

In another terminal

cd polyfuse-test
ls -ail mount
stat mount

I think the above gives you what you were asking for but let me know if not. It’s a working FUSE mount and template FS which can be modified to slot in a FileTree to do the actual lookup(), mknod(), mkdir() etc in the order that’s easiest for you.

I’m not sure if I can do any more code at this point without some instruction from you because I’m still unsure about defining traits etc or writing any code from scratch other than for learning. But if you can set a framework and give me a task I’ll have a go.

I suggest you fork the above and soon as you have something push it so I can have a look and maybe help out with sub-tasks. In the mean time I could pull together the proposed design into the document I started.

happybeing · August 4, 2020, 3:28pm

Thanks for these, added mine below.

Filesystems

A Five-Year Study of File-System Metadata (October 2007, PDF: online, [local](file:///home/mrh/Documents/MaidSafe/Research/filesystems/A%20Five-Year%20Study%20of%20File-System%20Metadata%20(October%202007).pdf))
inode - useful summary of Unix/POSIX implementation of inodes and hard links (Wikipedia)

Note: maybe only @POSIXMacOS allows hard links to directories? (see “Implications”)

UNIX/LINUX File System - Directories, inodes, Hard Links by Ian D. Allen (2019, course notes)
Linux Kernel > Filesystems (kernel.org)
- Linux VFS: superblock object, inode object
- ext4 Ext4 Disk Layout (kernel.org)

See “Special inodes” for special ino values (NOTE: 1 is not root, so maybe we must tell FUSE which is root?)

FUSE

The FUSE Protocol (john-millikin.com)

Gold: FUSE versions, protocol, implementation gotchas and debugging (more on @debugging running FUSE here)

libfuse wiki: Protocol Sketch by Csaba Henk (link)
FUSE protocol tutorial for Linux 2.6 by Péter Szabó (blogspot)
The FUSE Wire Protocol by Jakob Unterwurzacher (nuetzlich.net)
How can I create a userspace filesystem with FUSE without using libfuse? by Corbin Simpson (stackoverflow)
FUSE Protocol by Johan Rydberg (webarchive)
Linux Kernel > Filesystems > FUSE (kernel.org)

danda · August 5, 2020, 1:39am

@happybeing – good stuff. The polyfuse sample seems a good start. I notice its a bit different than the C api, eg I don’t see a create() call, maybe that is do_mknod() instead.

I’ve had my nose stuck deep in the code all day, so I have a bit of progress to report.

I started making a prototype filesystem class. This is a super quick/dirty implementation of a simplified fuse API. eg, it doesn’t bother with all the request/reply data structures, etc. The idea is to test/simulate/visualize the basic concepts we’ve been refining.

I was able to implement init(), lookup(), mkdir(), create(), and link(), and then to write a little test case that tries them out.

Code:

function test_fs() {
    // initial filesystem
    $fs = new filesystem();
    $fs->init(new replica());

    // display state
    $fs->print_current_state();

    // get ino for /
    $ino_root = $fs->lookup("/");

    // create /home/bob
    $ino_home = $fs->mkdir($ino_root, "home" );
    $ino_bob = $fs->mkdir($ino_home, "bob" );

    // create /home/bob/homework.txt and hard-link homework-link.txt
    $ino_homework = $fs->create($ino_bob, "homework.txt", 'c' );
    $fs->link($ino_homework, $ino_bob, "homework-link.txt");

    // display state
    $fs->print_current_state();
}

Output:

------- current filesystem state -------
- null => forest
  - 1000 => {"name":"root","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}
  - 1001 => {"name":"fileinodes","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}
  - 1002 => {"name":"trash","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}

local inode entries:
3 => {"ino":3,"tree_id":1000,"ref_count":1,"links":1,"is_file":false}
------- end state -------


------- current filesystem state -------
- null => forest
  - 1000 => {"name":"root","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}
    - 1003 => {"name":"home","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}
      - 1004 => {"name":"bob","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}
        - 1006 => {"name":"homework.txt","inode_id":1005}
        - 1007 => {"name":"homework-link.txt","inode_id":1005}
  - 1001 => {"name":"fileinodes","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}
    - 1005 => {"size":0,"ctime":1596590417,"mtime":1596590417,"kind":"file","xorname":null}
  - 1002 => {"name":"trash","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}

local inode entries:
3 => {"ino":3,"tree_id":1000,"ref_count":2,"links":1,"is_file":false}
4 => {"ino":4,"tree_id":1003,"ref_count":1,"links":1,"is_file":false}
5 => {"ino":5,"tree_id":1004,"ref_count":1,"links":1,"is_file":false}
6 => {"ino":6,"tree_id":1005,"ref_count":1,"links":2,"is_file":true}
------- end state -------

That’s as far as I’ve gotten, but working so far!

Note that the code is NOT creating inode entries for directories (or symlinks) under /fileinodes. This should save some space and a lot of unnecessary move operations.

happybeing · August 5, 2020, 9:36am

Great news @danda, keep going!

Yep, very nice.

danda · August 5, 2020, 11:32pm

Ok, I have more code (and tests) working.

In a nutshell:

link and unlink
writng and reading files (content stored in metadata for now)
readdir – listing directories
rename/move

Output of tests can be seen here.

The prototype filesystem code is on github here.

At this point, I feel that I’ve demonstrated the design of the data structure works. I may implement symlinks tmw, and then update design docs, before starting on rust code next week…
.
Also would be good to get some kind of timing/perf numbers…

happybeing · August 6, 2020, 2:35pm

The MemFS/FileTreeFS doesn’t implement Create() but it is defined in for the FUSE Operation trait in op.rs so we could implement it if needed. On the inode side, the Linux VFS docs here appear to say that create() is needed for regular files, but maybe the FUSE kernel stuff allows it to be optional.

happybeing · August 6, 2020, 3:51pm

I’ve added an implementation document to the SAFE FS repo so ./design now contains:

danda · August 10, 2020, 6:13pm

@happybeing here are a few issues worth discussion design-wise.

How best to implement stateful file handles?
How/where can we store local filesystem data?
What actions do we perform during mount and unmount?

In more detail:

How best to implement stateful file handles?

If a process has an open file handle, it can continue to use it even if the file has been deleted/unlinked by itself or another process. We need to ensure our design accomodates this. I plan to look into it more today. Let me know if you have any insights what the design pattern usually is for this…

How/where can we store local filesystem data?

We have the crdt-tree op logs, (possibly) a cache of the current replica state, and also the local inode entries and ino to uuid lookup indexes.

Typically one thinks of a fileysystem as storing such data within itself, eg in the superblock of a block device.

However, as a network fs, we have no underlying block device. I am thinking that we somehow designate a file on an existing mounted filesystem as our storage area. Similar to (or exactly the same?) as mounting a loopback filesystem such as a cd-rom ISO file.

It seems that any networked filesystem would have similar needs, so hopefully there is a standard way to go about this…?

What actions do we perform during mount and unmount?

In theory, we could download everything from the network at mount, replay logs, and arrive at current state. At unmount, we simply discard everything. Issues with this are that it is slow and isn’t great for “local-first” operation because it means that if we unmount while offline, we can’t re-mount the filesystem until we are online again.

So what seems desirable in the final production system is to locally cache logs and state, so that we can quickly re-mount without a potentially lengthy download.

For phase 1 we are building a local-only filesystem. My thinking is that we should not try to optmize/cache just yet. Instead, at unmount time, we simply write out the operation log to our local store (whatever it is). At mount, we read it in and replay ops. This functionality will be needed for initial mount (no local cache) from the network anyway.

improvement: rather than writing entire log/state at unmount, it should be written as events occur, and then unmount would perform a final sync.

happybeing · August 10, 2020, 7:00pm

I have boat chores to do over the next two days so I’ll give some quick responses now and maybe revisit when I have more time…

I’m not sure, but do we need to do this for FUSE? We have ref counts to hold the inode in memory until not in use, and each can only be open for write by one process so I think we need a memory cache for inodes and one for file data until synced.

So I think we may just need to support cached writes (in memory) until sync. Syncer shows how to do this using multiple threads, more on this in a bit.

Syncer handles this with a local disk cache of inodes, and another of file data split into 1MB chunks (each with its own cache in memory for unsynced data).

I agree we want caching as an option. My thought about adapting Syncer was to start using as much of its existing disk based caching (mentioned above), but then to remove this in favour of a ‘deeper’ cache using SAFE chunks - i.e. the actual raw chunks put/and got from the network.

However, I’m not sure that will work because I was assuming everything was stored as chunks, whereas am not sure that all data types can bet cached in this way. Do you know? For example, is Sequence chunked when put, and if so could we have a local cache of the chunks at the raw network side? From memory I think the whole Sequence is one object, but I’m not sure - I found the code difficult to decipher.

If so, we have can have the benefit of caching not just for FileTree but for all SAFE apps, and the chunks will already be encrypted (although I understand we’ll need an extra layer of encryption on top later.)

So in the spirit of small steps, maybe we can implement a node cache and a file chunk cache (both in memory) like syncer, which syncs to the FileTree and directly to the network.

Later adding a cache for raw SAFE chunks to avoid the need to download everything in every mount, possibly with a special cache of the TreeCDRT of that’s not catered for by a chunk cache.

A decent SAFE cache is probably a whole separate project.

I don’t know if the above is feasible. It’s how I was thinking when trying to adapt Syncer before beginning to work on this with you.

danda · August 10, 2020, 8:59pm

This paper has a pretty good description of how fuse works, including stateful vs stateless handling of opened files/dirs.

Also, not about fuse or even unix specifically, but I found this free book “Practical File System Design” that presents the BE OS filesystem, with mostly posix features. It goes into a lot of good detail.

danda · August 19, 2020, 10:11pm

@happybeing I’d like to update you and get your thoughts on this:

https://hackmd.io/1dBKrtkQTa656k070hxVvw?view

Also, I have some code for the crdt-tree in rust here:

See examples/tree.rs and src/tree/*

It isn’t properly commented yet, needs unit tests, may move crates, etc.

happybeing · August 20, 2020, 10:27am

Hi @danda,

Great to see such progress and that your work will be part of rust-crdt! The thesis sounds very interesting from your descriptions. I can’t access it, but as you have now gained such a good understanding of CRDT and FS areas, beyond mine in both I think, I’m not sure I can add much in those areas although I am happy to continue where you think I can still help out. Maybe review, sounding board etc? I’ve been learning Rust so maybe before long there’ll be something I can do on the coding side.

Your summary of the thesis sounds like you have a very good understanding of the issues, and some design options to evaluate. I don’t have much to add about most of the points you explored there as you seem to have covered things well.

One area had me thinking, where you talk about high level operations such as mkdir etc, versus low level tree-move. At present, SequenceCRDT and as I understand it, the plan for TreeCRDT, is that the whole data structure is loaded/saved to the network which is fine as a first step, and can work up to a point but comes with limitations. There are various ways to deal with that, but I’m just reminding myself of the idea that we could transmit operations, rather than load/save the entire data structure. So I’m wondering if you have any thoughts on that side in the light of the thesis implementation which sounds like it fits with that model?

I’ll have a play with the tree-crdt examples, hopefully today.

danda · August 20, 2020, 2:05pm

That might or might not happen. The author of rust-crdt would need to accept/merge the changes. Also, we are moving towards small, focused crates now, eg the xorname crate. So I may end up making a dedicated crate for crdt-tree (which rust-crdt could choose to include or not). I’m still considering the best path here, but leaning towards the dedicated crate.

Both LSeq (Sequence) and crdt-tree transmit operations between replicas, not full state. (CmRDT, not CvRDT). For LSeq the ops are Insert and Delete. For crdt-tree, there is only the Move operation. Or perhaps I missed your meaning?