FileTree CRDT for SAFE Network

happybeing · August 4, 2020, 10:42am

Yes, I agree this should work and was at some point thinking along these lines - that ino’s could be treated like file handles locally (when I saw the ref counting). I hadn’t confirmed that though so hats off to you.

I don’t think we should worry about the two indexes. If we implement local ref counts the indexes can be purged if they grow too big. It should be easy to optimise later as you recommended earlier.

I think we know enough to think about what lives in each of safe-fuse, safe-fs, FileTree, and TreeCDRT and the boundaries between them.

I’ve been thinking safe-fuse and safe-fs could be thin layers if FileTree provides a POSIX style outer, or we could keep FileTree to managing inodes.

I keep coming round to thinking safe-fs could instead live inside FileTree. I like the idea that from the developers’ perspective they just create a FileTree, and from this can get the level of FS API they want.

No need to decide any of this yet, just sharing thoughts and interested to hear any views. safe-fs can be built separately now and rolled into FileTree later if we choose.

I’ve been occupied elsewhere but plan to work on the FUSE code today and will let you know when there’s something to look at.

happybeing · August 4, 2020, 2:17pm

@danda re:

I’m having a look at using the MemFs example from polyfuse (which is based on the rust Filesystem trait you linked to, but with async support from tokio). It has several examples, including the in memory filesystem which is all in one file.

The MemFs example looks promising because it is compact, defines types which I think we can just use now which look right, e.g. see struct INode (link), and notice it has refcount and links:

struct INode {
    attr: FileAttr,
    xattrs: HashMap<OsString, Arc<Vec<u8>>>,
    refcount: u64,
    links: u64,
    kind: INodeKind,
}

enum INodeKind {
    RegularFile(Vec<u8>),
    Directory(Directory),
    Symlink(Arc<OsString>),
}

At the bottom of the file is the the Filesystem trait and in-between are the implementation functions.

Using FileTreeFS in polyfuse-test

I’ve forked the above as polyfuse-test and added a FileTreeFS example in branch filetree-test.

I made a copy of MemFS example in ./examples/examples/filetreefs.rs (with all occurances of MemFS renamed to FileTreeFS). So it is a copy and runs exactly as the MemFS example.

To try it out:

Build

git clone https://github.com/theWebalyst/polyfuse-test
cd polyfuse-test
git checkout filetree-test

Make a mount point and mount it

cd polyfuse-test
mkdir mount
cargo run --example filetreefs mount

In another terminal

cd polyfuse-test
ls -ail mount
stat mount

I think the above gives you what you were asking for but let me know if not. It’s a working FUSE mount and template FS which can be modified to slot in a FileTree to do the actual lookup(), mknod(), mkdir() etc in the order that’s easiest for you.

I’m not sure if I can do any more code at this point without some instruction from you because I’m still unsure about defining traits etc or writing any code from scratch other than for learning. But if you can set a framework and give me a task I’ll have a go.

I suggest you fork the above and soon as you have something push it so I can have a look and maybe help out with sub-tasks. In the mean time I could pull together the proposed design into the document I started.

happybeing · August 4, 2020, 3:28pm

Thanks for these, added mine below.

Filesystems

A Five-Year Study of File-System Metadata (October 2007, PDF: online, [local](file:///home/mrh/Documents/MaidSafe/Research/filesystems/A%20Five-Year%20Study%20of%20File-System%20Metadata%20(October%202007).pdf))
inode - useful summary of Unix/POSIX implementation of inodes and hard links (Wikipedia)

Note: maybe only @POSIXMacOS allows hard links to directories? (see “Implications”)

UNIX/LINUX File System - Directories, inodes, Hard Links by Ian D. Allen (2019, course notes)
Linux Kernel > Filesystems (kernel.org)
- Linux VFS: superblock object, inode object
- ext4 Ext4 Disk Layout (kernel.org)

See “Special inodes” for special ino values (NOTE: 1 is not root, so maybe we must tell FUSE which is root?)

FUSE

The FUSE Protocol (john-millikin.com)

Gold: FUSE versions, protocol, implementation gotchas and debugging (more on @debugging running FUSE here)

libfuse wiki: Protocol Sketch by Csaba Henk (link)
FUSE protocol tutorial for Linux 2.6 by Péter Szabó (blogspot)
The FUSE Wire Protocol by Jakob Unterwurzacher (nuetzlich.net)
How can I create a userspace filesystem with FUSE without using libfuse? by Corbin Simpson (stackoverflow)
FUSE Protocol by Johan Rydberg (webarchive)
Linux Kernel > Filesystems > FUSE (kernel.org)

danda · August 5, 2020, 1:39am

@happybeing – good stuff. The polyfuse sample seems a good start. I notice its a bit different than the C api, eg I don’t see a create() call, maybe that is do_mknod() instead.

I’ve had my nose stuck deep in the code all day, so I have a bit of progress to report.

I started making a prototype filesystem class. This is a super quick/dirty implementation of a simplified fuse API. eg, it doesn’t bother with all the request/reply data structures, etc. The idea is to test/simulate/visualize the basic concepts we’ve been refining.

I was able to implement init(), lookup(), mkdir(), create(), and link(), and then to write a little test case that tries them out.

Code:

function test_fs() {
    // initial filesystem
    $fs = new filesystem();
    $fs->init(new replica());

    // display state
    $fs->print_current_state();

    // get ino for /
    $ino_root = $fs->lookup("/");

    // create /home/bob
    $ino_home = $fs->mkdir($ino_root, "home" );
    $ino_bob = $fs->mkdir($ino_home, "bob" );

    // create /home/bob/homework.txt and hard-link homework-link.txt
    $ino_homework = $fs->create($ino_bob, "homework.txt", 'c' );
    $fs->link($ino_homework, $ino_bob, "homework-link.txt");

    // display state
    $fs->print_current_state();
}

Output:

------- current filesystem state -------
- null => forest
  - 1000 => {"name":"root","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}
  - 1001 => {"name":"fileinodes","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}
  - 1002 => {"name":"trash","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}

local inode entries:
3 => {"ino":3,"tree_id":1000,"ref_count":1,"links":1,"is_file":false}
------- end state -------


------- current filesystem state -------
- null => forest
  - 1000 => {"name":"root","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}
    - 1003 => {"name":"home","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}
      - 1004 => {"name":"bob","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}
        - 1006 => {"name":"homework.txt","inode_id":1005}
        - 1007 => {"name":"homework-link.txt","inode_id":1005}
  - 1001 => {"name":"fileinodes","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}
    - 1005 => {"size":0,"ctime":1596590417,"mtime":1596590417,"kind":"file","xorname":null}
  - 1002 => {"name":"trash","size":0,"ctime":1596590417,"mtime":1596590417,"kind":"dir"}

local inode entries:
3 => {"ino":3,"tree_id":1000,"ref_count":2,"links":1,"is_file":false}
4 => {"ino":4,"tree_id":1003,"ref_count":1,"links":1,"is_file":false}
5 => {"ino":5,"tree_id":1004,"ref_count":1,"links":1,"is_file":false}
6 => {"ino":6,"tree_id":1005,"ref_count":1,"links":2,"is_file":true}
------- end state -------

That’s as far as I’ve gotten, but working so far!

Note that the code is NOT creating inode entries for directories (or symlinks) under /fileinodes. This should save some space and a lot of unnecessary move operations.

happybeing · August 5, 2020, 9:36am

Great news @danda, keep going!

Yep, very nice.

danda · August 5, 2020, 11:32pm

Ok, I have more code (and tests) working.

In a nutshell:

link and unlink
writng and reading files (content stored in metadata for now)
readdir – listing directories
rename/move

Output of tests can be seen here.

The prototype filesystem code is on github here.

At this point, I feel that I’ve demonstrated the design of the data structure works. I may implement symlinks tmw, and then update design docs, before starting on rust code next week…
.
Also would be good to get some kind of timing/perf numbers…

happybeing · August 6, 2020, 2:35pm

The MemFS/FileTreeFS doesn’t implement Create() but it is defined in for the FUSE Operation trait in op.rs so we could implement it if needed. On the inode side, the Linux VFS docs here appear to say that create() is needed for regular files, but maybe the FUSE kernel stuff allows it to be optional.

happybeing · August 6, 2020, 3:51pm

I’ve added an implementation document to the SAFE FS repo so ./design now contains:

danda · August 10, 2020, 6:13pm

@happybeing here are a few issues worth discussion design-wise.

How best to implement stateful file handles?
How/where can we store local filesystem data?
What actions do we perform during mount and unmount?

In more detail:

How best to implement stateful file handles?

If a process has an open file handle, it can continue to use it even if the file has been deleted/unlinked by itself or another process. We need to ensure our design accomodates this. I plan to look into it more today. Let me know if you have any insights what the design pattern usually is for this…

How/where can we store local filesystem data?

We have the crdt-tree op logs, (possibly) a cache of the current replica state, and also the local inode entries and ino to uuid lookup indexes.

Typically one thinks of a fileysystem as storing such data within itself, eg in the superblock of a block device.

However, as a network fs, we have no underlying block device. I am thinking that we somehow designate a file on an existing mounted filesystem as our storage area. Similar to (or exactly the same?) as mounting a loopback filesystem such as a cd-rom ISO file.

It seems that any networked filesystem would have similar needs, so hopefully there is a standard way to go about this…?

What actions do we perform during mount and unmount?

In theory, we could download everything from the network at mount, replay logs, and arrive at current state. At unmount, we simply discard everything. Issues with this are that it is slow and isn’t great for “local-first” operation because it means that if we unmount while offline, we can’t re-mount the filesystem until we are online again.

So what seems desirable in the final production system is to locally cache logs and state, so that we can quickly re-mount without a potentially lengthy download.

For phase 1 we are building a local-only filesystem. My thinking is that we should not try to optmize/cache just yet. Instead, at unmount time, we simply write out the operation log to our local store (whatever it is). At mount, we read it in and replay ops. This functionality will be needed for initial mount (no local cache) from the network anyway.

improvement: rather than writing entire log/state at unmount, it should be written as events occur, and then unmount would perform a final sync.

happybeing · August 10, 2020, 7:00pm

I have boat chores to do over the next two days so I’ll give some quick responses now and maybe revisit when I have more time…

I’m not sure, but do we need to do this for FUSE? We have ref counts to hold the inode in memory until not in use, and each can only be open for write by one process so I think we need a memory cache for inodes and one for file data until synced.

So I think we may just need to support cached writes (in memory) until sync. Syncer shows how to do this using multiple threads, more on this in a bit.

Syncer handles this with a local disk cache of inodes, and another of file data split into 1MB chunks (each with its own cache in memory for unsynced data).

I agree we want caching as an option. My thought about adapting Syncer was to start using as much of its existing disk based caching (mentioned above), but then to remove this in favour of a ‘deeper’ cache using SAFE chunks - i.e. the actual raw chunks put/and got from the network.

However, I’m not sure that will work because I was assuming everything was stored as chunks, whereas am not sure that all data types can bet cached in this way. Do you know? For example, is Sequence chunked when put, and if so could we have a local cache of the chunks at the raw network side? From memory I think the whole Sequence is one object, but I’m not sure - I found the code difficult to decipher.

If so, we have can have the benefit of caching not just for FileTree but for all SAFE apps, and the chunks will already be encrypted (although I understand we’ll need an extra layer of encryption on top later.)

So in the spirit of small steps, maybe we can implement a node cache and a file chunk cache (both in memory) like syncer, which syncs to the FileTree and directly to the network.

Later adding a cache for raw SAFE chunks to avoid the need to download everything in every mount, possibly with a special cache of the TreeCDRT of that’s not catered for by a chunk cache.

A decent SAFE cache is probably a whole separate project.

I don’t know if the above is feasible. It’s how I was thinking when trying to adapt Syncer before beginning to work on this with you.

danda · August 10, 2020, 8:59pm

This paper has a pretty good description of how fuse works, including stateful vs stateless handling of opened files/dirs.

Also, not about fuse or even unix specifically, but I found this free book “Practical File System Design” that presents the BE OS filesystem, with mostly posix features. It goes into a lot of good detail.

danda · August 19, 2020, 10:11pm

@happybeing I’d like to update you and get your thoughts on this:

https://hackmd.io/1dBKrtkQTa656k070hxVvw?view

Also, I have some code for the crdt-tree in rust here:

See examples/tree.rs and src/tree/*

It isn’t properly commented yet, needs unit tests, may move crates, etc.

happybeing · August 20, 2020, 10:27am

Hi @danda,

Great to see such progress and that your work will be part of rust-crdt! The thesis sounds very interesting from your descriptions. I can’t access it, but as you have now gained such a good understanding of CRDT and FS areas, beyond mine in both I think, I’m not sure I can add much in those areas although I am happy to continue where you think I can still help out. Maybe review, sounding board etc? I’ve been learning Rust so maybe before long there’ll be something I can do on the coding side.

Your summary of the thesis sounds like you have a very good understanding of the issues, and some design options to evaluate. I don’t have much to add about most of the points you explored there as you seem to have covered things well.

One area had me thinking, where you talk about high level operations such as mkdir etc, versus low level tree-move. At present, SequenceCRDT and as I understand it, the plan for TreeCRDT, is that the whole data structure is loaded/saved to the network which is fine as a first step, and can work up to a point but comes with limitations. There are various ways to deal with that, but I’m just reminding myself of the idea that we could transmit operations, rather than load/save the entire data structure. So I’m wondering if you have any thoughts on that side in the light of the thesis implementation which sounds like it fits with that model?

I’ll have a play with the tree-crdt examples, hopefully today.

danda · August 20, 2020, 2:05pm

That might or might not happen. The author of rust-crdt would need to accept/merge the changes. Also, we are moving towards small, focused crates now, eg the xorname crate. So I may end up making a dedicated crate for crdt-tree (which rust-crdt could choose to include or not). I’m still considering the best path here, but leaning towards the dedicated crate.

Both LSeq (Sequence) and crdt-tree transmit operations between replicas, not full state. (CmRDT, not CvRDT). For LSeq the ops are Insert and Delete. For crdt-tree, there is only the Move operation. Or perhaps I missed your meaning?

happybeing · August 20, 2020, 2:49pm

You’re right, I was confusing Sequence data type with LSeq CRDT. So my question stands, but I’m asking it about the SAFE data types Sequence and FileTree rather than the underlying CRDTs.

My understanding is the the Sequence data type is loaded/saved as a single blob, and that each replica has a complete copy. I could be wrong about that, I tried to clarify How SAFE Sequence CRDT Data Type is implemented, so this is the basis of my understanding and from there, I thought you planned to implement FileTree in a similar way at least to start? I don’t have a picture of how the TreeCRDT fits into this, so maybe I just need to understand your plans for implementation (e.g. block diagram stuff) better.

danda · August 20, 2020, 3:19pm

So basically any Op based CRDT can be thought of as an ordered sequence of operations, which are stored locally in a log. For any replica to become “up to date” starting from zero, they must download the log entries from a peer and replay them. Each application of an operation results in a different local state. At any point, this local state can be saved to disk or loaded from disk. This is faster than replaying all log ops from the beginning. In this sense, one could think of a Sequence as being loaded/saved as a single blob. I haven’t dug into Sequence impl yet myself (beyond LSeq) so I can’t say for certain if bootstrapping from peers is performed via log replay or requesting complete current state, though I would imagine the former as a first safe step, since the latter (I think) implies trusting the peer implicitly. @bochaco? Anyway, once a replica is “caught up”, new operations must be transmitted/received/processed individually as they occur.

At least, this is my understanding but again, I haven’t dug into Sequence/AT2 code yet.

happybeing · August 20, 2020, 3:36pm

I think this just shows that I don’t understand how the SAFE data types are implemented. I tried reviewing the code and put my conclusions in the post linked above, some of which were wrong and then corrected. But things may well be changing under my feet anyway. I shall wait!

bochaco · August 20, 2020, 3:38pm

In an operation based CRDT like Sequence we don’t need to store the operations log, only current state, a new replica gets the current state and based on the information it contains (vclocks) it can send new operations which can be applied by other replicas.

danda · August 28, 2020, 3:56pm

Update:

crdt_tree has been published on github and crates.io.

There are plenty of usage examples in examples/tree.rs.

Naming of the data structures is a bit awkward and may yet change.

I made an alternative implementation of filesystem.php prototype that uses ino for tree identifiers, but with the ino u64 segmented into 16 bits for actor and 48 bits for inode identifier.

Wins:

a simpler data representation
eliminates lookups in local hashmaps.
all data is stored in the Tree, so replica states match without requiring filesystem-specific operations.

Some Drawbacks:

Each replica must keep a local ino counter and somehow persist it between mounts.
Maximum of 2^16 writing replicas. (seems ok)
problem: Each replica_id must be unique. How to map 256-bit SAFE key to 16 bit id, and guarantee uniqueness?
presently nowhere to store things like lookup() count, but its optional anyway.

Even with the drawbacks, I tend to favor this approach. In particular because it fixes the issue with replica states not matching without requiring inverse operations.

Here is some output from the alternative implementation that shows the simplified data structure. Notice there is no local state, no ino <–> uuid mappings.

------- fs state after: Home dir created -------
- null => forest
  - 281474976710656 => {"name":"root","size":0,"ctime":1598629822,"mtime":1598629822,"kind":"dir"}
    - 281474976710659 => {"name":"home","size":0,"ctime":1598629822,"mtime":1598629822,"kind":"dir"}
      - 281474976710660 => {"name":"bob","size":0,"ctime":1598629822,"mtime":1598629822,"kind":"dir"}
        - 281474976710661 => {"name":"projects","size":0,"ctime":1598629822,"mtime":1598629822,"kind":"dir"}
        - 281474976710663 => {"name":"homework.txt","inode_id":281474976710662}
        - 281474976710664 => {"name":"homework-link.txt","inode_id":281474976710662}
  - 281474976710657 => {"name":"fileinodes","size":0,"ctime":1598629822,"mtime":1598629822,"kind":"dir"}
    - 281474976710662 => {"size":0,"ctime":1598629822,"mtime":1598629822,"kind":"file","links":2,"content":null}
  - 281474976710658 => {"name":"trash","size":0,"ctime":1598629822,"mtime":1598629822,"kind":"dir"}
------- end state -------

Next, I may attempt another alternative impl using inverse operations just to get a better feel for that approach. It could potentially provide more flexibility later on with regards to replicas storing local state outside the Tree.

danda · August 29, 2020, 6:01am

There’s an issue regarding how to deal with conflicting names of directory children. @happybeing @here let me know your thoughts please.

https://hackmd.io/JGpiRTCMT1-OTgSSPKHKzQ?view#file-or-dir-name-conflicts