FileTree CRDT for SAFE Network

The MemFS/FileTreeFS doesn’t implement Create() but it is defined in for the FUSE Operation trait in op.rs so we could implement it if needed. On the inode side, the Linux VFS docs here appear to say that create() is needed for regular files, but maybe the FUSE kernel stuff allows it to be optional.

I’ve added an implementation document to the SAFE FS repo so ./design now contains:

@happybeing here are a few issues worth discussion design-wise.

  1. How best to implement stateful file handles?

  2. How/where can we store local filesystem data?

  3. What actions do we perform during mount and unmount?

In more detail:

How best to implement stateful file handles?

If a process has an open file handle, it can continue to use it even if the file has been deleted/unlinked by itself or another process. We need to ensure our design accomodates this. I plan to look into it more today. Let me know if you have any insights what the design pattern usually is for this…

How/where can we store local filesystem data?

We have the crdt-tree op logs, (possibly) a cache of the current replica state, and also the local inode entries and ino to uuid lookup indexes.

Typically one thinks of a fileysystem as storing such data within itself, eg in the superblock of a block device.

However, as a network fs, we have no underlying block device. I am thinking that we somehow designate a file on an existing mounted filesystem as our storage area. Similar to (or exactly the same?) as mounting a loopback filesystem such as a cd-rom ISO file.

It seems that any networked filesystem would have similar needs, so hopefully there is a standard way to go about this…?

What actions do we perform during mount and unmount?

In theory, we could download everything from the network at mount, replay logs, and arrive at current state. At unmount, we simply discard everything. Issues with this are that it is slow and isn’t great for “local-first” operation because it means that if we unmount while offline, we can’t re-mount the filesystem until we are online again.

So what seems desirable in the final production system is to locally cache logs and state, so that we can quickly re-mount without a potentially lengthy download.

For phase 1 we are building a local-only filesystem. My thinking is that we should not try to optmize/cache just yet. Instead, at unmount time, we simply write out the operation log to our local store (whatever it is). At mount, we read it in and replay ops. This functionality will be needed for initial mount (no local cache) from the network anyway.

improvement: rather than writing entire log/state at unmount, it should be written as events occur, and then unmount would perform a final sync.

1 Like

I have boat chores to do over the next two days so I’ll give some quick responses now and maybe revisit when I have more time…

I’m not sure, but do we need to do this for FUSE? We have ref counts to hold the inode in memory until not in use, and each can only be open for write by one process so I think we need a memory cache for inodes and one for file data until synced.

So I think we may just need to support cached writes (in memory) until sync. Syncer shows how to do this using multiple threads, more on this in a bit.

Syncer handles this with a local disk cache of inodes, and another of file data split into 1MB chunks (each with its own cache in memory for unsynced data).

I agree we want caching as an option. My thought about adapting Syncer was to start using as much of its existing disk based caching (mentioned above), but then to remove this in favour of a ‘deeper’ cache using SAFE chunks - i.e. the actual raw chunks put/and got from the network.

However, I’m not sure that will work because I was assuming everything was stored as chunks, whereas am not sure that all data types can bet cached in this way. Do you know? For example, is Sequence chunked when put, and if so could we have a local cache of the chunks at the raw network side? From memory I think the whole Sequence is one object, but I’m not sure - I found the code difficult to decipher.

If so, we have can have the benefit of caching not just for FileTree but for all SAFE apps, and the chunks will already be encrypted (although I understand we’ll need an extra layer of encryption on top later.)

So in the spirit of small steps, maybe we can implement a node cache and a file chunk cache (both in memory) like syncer, which syncs to the FileTree and directly to the network.

Later adding a cache for raw SAFE chunks to avoid the need to download everything in every mount, possibly with a special cache of the TreeCDRT of that’s not catered for by a chunk cache.

A decent SAFE cache is probably a whole separate project.

I don’t know if the above is feasible. It’s how I was thinking when trying to adapt Syncer before beginning to work on this with you.

2 Likes

This paper has a pretty good description of how fuse works, including stateful vs stateless handling of opened files/dirs.

Also, not about fuse or even unix specifically, but I found this free book “Practical File System Design” that presents the BE OS filesystem, with mostly posix features. It goes into a lot of good detail.

http://www.nobius.org/dbg/practical-file-system-design.pdf

2 Likes

@happybeing I’d like to update you and get your thoughts on this:

https://hackmd.io/1dBKrtkQTa656k070hxVvw?view

Also, I have some code for the crdt-tree in rust here:

See examples/tree.rs and src/tree/*

It isn’t properly commented yet, needs unit tests, may move crates, etc.

2 Likes

Hi @danda,

Great to see such progress and that your work will be part of rust-crdt! The thesis sounds very interesting from your descriptions. I can’t access it, but as you have now gained such a good understanding of CRDT and FS areas, beyond mine in both I think, I’m not sure I can add much in those areas although I am happy to continue where you think I can still help out. Maybe review, sounding board etc? I’ve been learning Rust so maybe before long there’ll be something I can do on the coding side.

Your summary of the thesis sounds like you have a very good understanding of the issues, and some design options to evaluate. I don’t have much to add about most of the points you explored there as you seem to have covered things well.

One area had me thinking, where you talk about high level operations such as mkdir etc, versus low level tree-move. At present, SequenceCRDT and as I understand it, the plan for TreeCRDT, is that the whole data structure is loaded/saved to the network which is fine as a first step, and can work up to a point but comes with limitations. There are various ways to deal with that, but I’m just reminding myself of the idea that we could transmit operations, rather than load/save the entire data structure. So I’m wondering if you have any thoughts on that side in the light of the thesis implementation which sounds like it fits with that model?

I’ll have a play with the tree-crdt examples, hopefully today.

That might or might not happen. The author of rust-crdt would need to accept/merge the changes. Also, we are moving towards small, focused crates now, eg the xorname crate. So I may end up making a dedicated crate for crdt-tree (which rust-crdt could choose to include or not). I’m still considering the best path here, but leaning towards the dedicated crate.

Both LSeq (Sequence) and crdt-tree transmit operations between replicas, not full state. (CmRDT, not CvRDT). For LSeq the ops are Insert and Delete. For crdt-tree, there is only the Move operation. Or perhaps I missed your meaning?

You’re right, I was confusing Sequence data type with LSeq CRDT. So my question stands, but I’m asking it about the SAFE data types Sequence and FileTree rather than the underlying CRDTs.

My understanding is the the Sequence data type is loaded/saved as a single blob, and that each replica has a complete copy. I could be wrong about that, I tried to clarify How SAFE Sequence CRDT Data Type is implemented, so this is the basis of my understanding and from there, I thought you planned to implement FileTree in a similar way at least to start? I don’t have a picture of how the TreeCRDT fits into this, so maybe I just need to understand your plans for implementation (e.g. block diagram stuff) better.

So basically any Op based CRDT can be thought of as an ordered sequence of operations, which are stored locally in a log. For any replica to become “up to date” starting from zero, they must download the log entries from a peer and replay them. Each application of an operation results in a different local state. At any point, this local state can be saved to disk or loaded from disk. This is faster than replaying all log ops from the beginning. In this sense, one could think of a Sequence as being loaded/saved as a single blob. I haven’t dug into Sequence impl yet myself (beyond LSeq) so I can’t say for certain if bootstrapping from peers is performed via log replay or requesting complete current state, though I would imagine the former as a first safe step, since the latter (I think) implies trusting the peer implicitly. @bochaco? Anyway, once a replica is “caught up”, new operations must be transmitted/received/processed individually as they occur.

At least, this is my understanding but again, I haven’t dug into Sequence/AT2 code yet.

1 Like

I think this just shows that I don’t understand how the SAFE data types are implemented. I tried reviewing the code and put my conclusions in the post linked above, some of which were wrong and then corrected. But things may well be changing under my feet anyway. I shall wait!

In an operation based CRDT like Sequence we don’t need to store the operations log, only current state, a new replica gets the current state and based on the information it contains (vclocks) it can send new operations which can be applied by other replicas.

Update:

  1. crdt_tree has been published on github and crates.io.

There are plenty of usage examples in examples/tree.rs.

Naming of the data structures is a bit awkward and may yet change.

  1. I made an alternative implementation of filesystem.php prototype that uses ino for tree identifiers, but with the ino u64 segmented into 16 bits for actor and 48 bits for inode identifier.

Wins:

  • a simpler data representation
  • eliminates lookups in local hashmaps.
  • all data is stored in the Tree, so replica states match without requiring filesystem-specific operations.

Some Drawbacks:

  • Each replica must keep a local ino counter and somehow persist it between mounts.
  • Maximum of 2^16 writing replicas. (seems ok)
  • problem: Each replica_id must be unique. How to map 256-bit SAFE key to 16 bit id, and guarantee uniqueness?
  • presently nowhere to store things like lookup() count, but its optional anyway.

Even with the drawbacks, I tend to favor this approach. In particular because it fixes the issue with replica states not matching without requiring inverse operations.

Here is some output from the alternative implementation that shows the simplified data structure. Notice there is no local state, no ino <–> uuid mappings.

------- fs state after: Home dir created -------
- null => forest
  - 281474976710656 => {"name":"root","size":0,"ctime":1598629822,"mtime":1598629822,"kind":"dir"}
    - 281474976710659 => {"name":"home","size":0,"ctime":1598629822,"mtime":1598629822,"kind":"dir"}
      - 281474976710660 => {"name":"bob","size":0,"ctime":1598629822,"mtime":1598629822,"kind":"dir"}
        - 281474976710661 => {"name":"projects","size":0,"ctime":1598629822,"mtime":1598629822,"kind":"dir"}
        - 281474976710663 => {"name":"homework.txt","inode_id":281474976710662}
        - 281474976710664 => {"name":"homework-link.txt","inode_id":281474976710662}
  - 281474976710657 => {"name":"fileinodes","size":0,"ctime":1598629822,"mtime":1598629822,"kind":"dir"}
    - 281474976710662 => {"size":0,"ctime":1598629822,"mtime":1598629822,"kind":"file","links":2,"content":null}
  - 281474976710658 => {"name":"trash","size":0,"ctime":1598629822,"mtime":1598629822,"kind":"dir"}
------- end state -------

Next, I may attempt another alternative impl using inverse operations just to get a better feel for that approach. It could potentially provide more flexibility later on with regards to replicas storing local state outside the Tree.

2 Likes

There’s an issue regarding how to deal with conflicting names of directory children. @happybeing @here let me know your thoughts please.

https://hackmd.io/JGpiRTCMT1-OTgSSPKHKzQ?view#file-or-dir-name-conflicts

1 Like

Looks like great work @danda, so much to get your head around and you seem to fly through it! I’ve not delved into the code, so just some quick thoughts about the filename conflict at this point.

At first I thought having a conflicts directory was good idea, but thinking about how conflicts arise and what might happen next I think renaming to append the replica ID is the best option (also better than using the actor + clock). The reason is because I think this will be easier to understand and recover from, and may also behave better if not noticed.

Consider a simple case of two replicas A and B like your example, they both create file.txt for different purposes. When A merges B, the result will be:

Situation 1

A has:
file-A.txt
file-B.txt

B has:
file.txt

B will continue working on file.txt unawares until it merges with some version of A. What happens next on A is likely to be one of two things:

  • if the conflict is not noticed and A/file.txt is written again with newer content: the result is that A has three files: file.txt, file-A.txt and file-B.txt. From here on apps can continue using A/file.txt as before. If A merges B again, we return to Situation 1: file.txt again disappears to leave a new version of file-A.txt and the file-B.txt (which may be the same or different depending on whether it was changed on replica B). This situation continues until…
  • the missing A/file.txt causes an error (e.g. application tries to load and fails). The outcome here may be that a user intervenes to recover the situation by creating a new file.txt (or restoring from backup etc) and continuing unaware of the cause, which will recur next time a merge happens and A has a repeat of Situation 1. Alternatively the user (or a savvy app) notices the problem, and chooses a non-conflicting alternative name new-file.txt re-creates its content (e.g. from file-A.txt or a backup) and can now continue without further conflicts.

If at any point replica B merges changes from A before A cleans up the conflict, B will end up with its own Situation 1 (its file.txt disappears with file-A.txt and file-B.txt in its place). And the above two scenarios can arise for B.

How to clean up the conflict? If A cleans up before B merges as follows:

  • A does not re-create file.txt, begins using using new-file.txt and leaves file-A.txt and file-B.txt. When B merges A, B will see the conflicting copies after a merge, but would continue to use its own file.txt and so may not notice that there was a conflict.

Alternatively, A could delete file-A.txt when cleaning up the conflict, but probably not file-B.txt. Consider:

  • A does not re-create file.txt, begins using using new-file.txt and deletes both file-A.txt and file-B.txt, allowing B to continue unaware of a conflict. This seems fine, until you consider a three or more way conflict on a single file, so I think best if a replica only ever deletes its corresponding conflicting copy because deleting the others makes it harder for the other replicas to recover their file.txt. I guess they can still do this from the history, but its just easier to use the conflict-copy although its a bit moot as I think eventually things can still be sorted out (manually).

I think using the replica ID as described is relatively easy for humans to understand and recover from than moving conflicts somewhere else. And using just the replica ID in the name avoids creating unnecessary extra files with each merge, compared to using “Actor + Clock” which would be confusing, and could create large numbers of files over time.

I have a question! The replica log allows the structure of the tree to be recovered as it was at any point in the history, but does this include some, all or no metadata? I’ve been assuming this is all available, but am not sure.

All metadata.

LogMove is defined as (timestamp, parent_id, metadata, child_id, oldp), where oldp is (parent_id, metadata) and represents previous state of child_id before this operation was applied, or None if it didn’t exist.

still thinking about the rest.

2 Likes

@happybeing I did some basic experiments with filename conflicts, and made a writeup.

I wouldn’t say the matter is settled, but perhaps I’ve shed a bit more light on it.

Oh that’s tricky indeed, naturally I hadn’t thought of that. I think you have a good way to proceed and that we should make a note that this is an area it might be good to improve, or maybe not… :smile:

I think that in situations where the same app is being used in different replicas, and where it generates names automatically, then merge errors could create s lot of pain for users, in others it will rarely happen and be easier for them to handle when it does (eg where users have named things manually).

However, I don’t think any CRDT scheme can solve the former and it will instead require adaptation in the apps or in workflow by the users, so ultimately your current plan may be more than good enough.

Very nicely worked through @danda.

@happybeing so there’s something I’m unclear on that maybe you can test out, with the fuse filesystem template you created.

I’m curious what happens if the filesystem allows two dir entries with the same name to be created. Does fuse (or kernel) care?

My guess is that what would happen is:

  1. readdir() returns all names, so ls would show the dups. (unless some layer detects the dups and errors out?)

  2. lookup() can only return a single entry, so it would find first match and dup(s) ignored.

This would effectively make dups visible, but inaccessible.

I’m not suggesting this is a good approach in any way, I’m just kind of curious what would happen.

My guess would be that the filesystem implementation determines this behaviour and would detect and refuse to create a duplicate (as in directory exists). FUSE can just pass things through.