FileTree CRDT for SAFE Network

happybeing · July 18, 2020, 5:09pm

@happybeing’s notes on the following excellent talk:

CRDTs: The Hard Parts — Martin Kleppmann’s talks (link)

[Video see 35:30]
by Martin Kleppmann very clear Cambridge Academic.
The point of this talk is to show how to implement CRDTs so they behave well, which is the hard part. [10:20]
Topics:

Interleaving anomalies in text editing
Moving (re-ordering) list items
Moving sub-trees of a tree
Reducing metadata overhead of CRDTs

Intro to Operational Transformation and CRDTs

A very good intro to updating shared data by Operational Transformation (Google Docs pre-2006 tech) and Conflict-free Replicated Data Types (post 2006 tech)

[03:50] Operational Transformation (Google Docs pre-2006 tech)
[06:40] OT relies on assumption and requirement: all communication must go via a central server

[08:00] Conflict-free Replicated Data Types (post 2006 tech)
Does not require a central server.
Convergence is guaranteed if any two nodes have seen the same set of operations (even if in a different order), they will be in the same state.
But the final state may not be good "CRDTs are really easy to implement badly" -> its actually a hard problem to get CRDTs to satisfy user expectations.
The point of this talk is to show how to implement CRDTs so they behave well, which is the hard part.

Interleaving anomalies in text editing

[10:38]
(Published at PaPoC 2019)
Every character has a unique id, which remains the same forever. Ids are chosen to preserve the ordering within the document. Document can be implemented as a set of tuples (which include an element for the source node).
[16:00] Jumbles of letters can arise when nodes choose the same number for a given position.
[17:40] Table of six CRDT algorithms and how well they do this. Two avoid this issue very well: Treedoc and WOOT, but these are very inefficient…
[19:30] RGA has a lesser interleaving problem, but he has solved this and published this improvement to RGA

Moving (re-ordering) list items

[25:50] All the preceding algorithms support insert and delete, but not move. If you implement move with insert and delete, concurrent moves of the same item result in duplication.
[29:30] Can be solved with existing ‘last writer wins register’ CRDT technique. All CRDTs have an identifier that can be used to implement this.
[32:38] gets more difficult if you want to move more than one item at a time
[34:30] this is an unsolved problem!

Moving sub-trees of a tree

[35:30] Concurrent moves of the same node is a similar problem to moving elements in lists
Desirable outcome (for trees) is to use ‘last writer wins’ (LWW) behaviour to select one or other of the moves as the deterministic result.
[38:22] **Trees are harder than lists, have additional complications. ** Example, moving a-b to b (moved node is an ancestor of the destination).
Our CRDT could disallow this because it breaks the tree (making it a graph), but unfortunately we have the problem of concurrent changes.
[39:46] Best outcome is again to have an LWW choose which of the moves happens and that becomes the deterministic result (effectively choosing one or the other move, and ignoring the other).
Testing on Google drive, it hasn’t solved this!
[42:55] Solution is to give each operation a time stamp
Each node has its own series of operations. To merge two overlapping sequences, apply the changes from the other in time order and if necessary, undo your own operations so that the next operation being merged is applied to the one just preceding it. Operations can then all be applied in time order.
[47:xx] Performance good enough for local users moving stuff around their file system (e.g. 600 moves per second), but not so good for big data.

Key slide
[48:00] How it works!

struct MoveOp {
  TimeStamp		time;		* Globally unique (e.g. Lamport timestamp. See also Difference between Lamport timestamps and Vector clocks on stackexchange.)
  NodeId		parent;	// Destination of move
  Metadata		meta;		// e.g. filename within parent directory
  NodeId		child;		// subtree being moved
}

When performing the move we create a log entry which is a copy of the MoveOp plus two additional fields:
Option oldParent; // empty if previously not in the tree
Option oldMeta; // empty if previously not in the tree

Key slide
[50:14] Tree is a set of triples (parent, meta, child)
[50:54 ] Formal definition of the move operation
[52:18] We proved several theorems about this:

every tree node has a unique parent
the tree contains no cycles
it’s a CRDT

Note: using NodeId allows us to ‘steal’ the node from whereever it currently is, so we don’t care where it is when we take it.

☐ Does this conflict with ‘path based’ cf. @danda 's quandary?

Reducing metadata overhead of CRDTs

[53:13] Making CRDTs more efficient - can easily have 100 bytes of metadata for one byte of the document!
Their work on making CRDTs is being done in their automerge CRDT (github: javascript version, Rust version)
“A JSON-like data structure (a CRDT) that can be modified concurrently by different users, and merged again automatically.”
Results table for different degrees of pruning using their solution, and without it.
[1:01:20] How this works…

store all insertion and delete operations
each op has a lamport timestamp as its id
each op references its predecessor
store ops in document order
[1:02:25] Figure: Columnar encoding - each row is one operation. Last column records the operation which deleted this character (which might have more than one value if it was deleted by more than one user).
[1:04:13] Encode the table using some simple tricks…
key point:
[1:08:00] This demonstrates that tombestones are not that significant (enables CRDT merging with any other CRDT?) in terms of size, compared to using an efficient binary encoding format such as the one presented here.
References:
[1:09:45] Refs for each of the CRDT algorithms mentioned: Logoot, LSEQ, RGA, Treedoc, WOOT, A-strong
[1:09:55] Refs for publications Martin has contributed to:

Interleaving anomaly (2019)
Proof of no interleaving in RGA (2018)
Moving list items (2020)
Move operation in CRDT trees (Preprint)
Reducing metadata overhead (2019) https://github.com/automerge/automerge-perf/blob/master/columnar/README.md
Local-first software (2019) (online):

In this article we propose “local-first software”: a set of principles for software that enables both collaboration and ownership for users. Local-first ideals include the ability to work offline and collaborate across multiple devices, while also improving the security, privacy, long-term preservation, and user control of data.

[end] Useful links: https://crdt.tech/, https://twitter.com/martinkl, mk428@cl.cam.ac.uk

happybeing · July 19, 2020, 12:42pm

This is my response to your post on the community forum.

@danda: Sure, if you can send me a pointer [to your info on Syncer], that would be great. I’ve taken a brief look at syncer, as well as surveying various other fuse based systems, rust and beyond. Initially though, the focus will really be on getting the CRDT data type working.

I suggest you ask when needed and I’ll give you the latest then, as I may be updating my notes over time.

@danda: Until we get the Tree CRDT type translated to rust and are able to test it out, this is all just theory. Even then, there could very well be some showstopper. So in that sense, you might want to continue your efforts against the present File API. However, if all goes as we hope it will, that API would be replaced. (Though a thin File API compat layer could possibly live on, not my call.)

As I begin to think about how CRDTs can be implemented I’m coming up with questions around what the approach is currently, and the goals in terms of functionality and acceptable performance.

I’ll put these into a separate post, and it would help to have an understanding of how CRDTs are currently being used (e.g. for SequencedData) and if the FileTree CRDT is intended to follow a similar pattern, i.e. resolving all network copies to a definitive result, rather than providing the ‘local-first’ functionality envisaged by Marting Kleppmann.

I imagine the intention is the former, but I think there could well be applications for local-first uses too such as a caching file-system like Syncer where a portion of the FileTree (maybe sub-trees) are cached to local disk but pulled from the network when you want to work on them.

@danda: That’s great. Ok, so here’s a couple things I could use some assistance with:

1 Translate Isabelle/HOL formalism to Rust, or even to pseudo-code. There is also a Scala implementation that was machine generated from Isabelle/HOL but it is (a) inefficient and (b) extremely obfuscated, so I think not a good starting point. If you or anyone here can assist with this, it would be amazing.

I agree. They do have two Scala implementations, one I think using the more efficient hashmap mentioned in Martin’s video, but it would be tricky to translate that code into Rust and the result would not be any more understandable or maintainable.

I looked for Isabelle/HOL to Rust and Scala to Rust generators without success, but for maintainability those might not be good routes anyway. They could help (for quick prototyping, evaluation and reference implementations) but might not be suitable for deployed code. Regardless, I didn’t find anything.

Looking at Isabelle/HOL I’m recoiling from learning that (the manual is over a thousand pages and I’m not good with mathematically expressed stuff), but not ruling it out. I may try pseudo coding Martin’s explanations in the video and the paper as these seemed fairly easy to follow at first reading/listen, but the Isabelle/HOL might still be a good option, and maybe not as hard as it looks (but that Scala code suggests it will be!).

2 I’m still trying to decide if its better to use the Fuse low-level (inode based) or high-level (path based) API. Keeping in mind of course that we would like this to work on Windows also. Path based seems easier, but has limited support in rust/unix… ie only fuse-mt. I’d be curious to hear your thoughts on that, and how an inode u64 might be used/mapped if going that route.

I think I understand, but can you clarify what you mean by path based in case I’m missing something. Feel free to dump docs on me at any point where that is more efficient for you, or just might be useful. I don’t mind working out what to read/skip myself.

@danda: In any event, the idea will be that we have a FileSystem API/crate that can be called directly by SAFE apps, and is also used/wrapped by FUSE daemons (for unix/mac/win). So we will need these wrapper daemons, and that may be something you can get started on…

Yep! I’m excited about this. I can see me helping with bits and pieces to support you, and also us each building from one end (e.g. me FUSE and you safe-nd) and meeting in the middle.

@danda: Hopefully the above gives you some ideas. I will share some design docs as things flesh out a bit more.

Can’t wait! Just to repeat: feel free to dump docs on me at any point where that is more efficient for you, or just might be useful. I don’t mind working out what to read/skip myself.

happybeing:

danda · July 20, 2020, 5:50pm

@happybeing great writeup of the video, thanks for that!

As I see it, some high-level goals are:

create a more flexible filesystem API than we presently have.
provide a FUSE abstraction for legacy apps to utilize SAFE filesystem.
avoid apps blocking on network requests (as much as possible)
enable working with filesystem offline
handle concurrent modifications well. (strong eventual consistency)
as performant as possible within above constraints.

I’m not sure I see the distinction. As I think of it, local first is kind of a side-effect of strong-eventual-consistency. ie, the local user/client is maintaining a sort of replica/fork of the tree (or a piece of it) and may work offline for a period, with changes being synced to/from the network when they go online again.

@bochaco could provide details on how Sequence is presently implemented/used.

That is definitely the intent. @oetyng’s Safe.NetworkDrive already does this. The problem with it, as I understand, was that merge conflict resolution was still mostly an unsolved problem. Also, storing tree data in a sequence is sub-optimal. The crdt-tree datatype provides a native tree data structure and resolves the merge conflicts via strong eventual consistency… at least for the file-tree metadata. File contents are still an open issue, design-wise. Probably it will continue to be ImmutableData. Though in theory it could exist as a metadata field in the crdt tree itself.

correct, that way lies madness.

right, me too.

I think you’re off the hook on that one. I was able to make a little prototype of the algo in PHP, that seems to work. I plan to translate it to Rust once I’ve run some concurrency tests and verify that behavior.

The C libfuse exposes two API layers that fuse filesystems can choose to utilize. The “low level” API uses u64 inode handles while the “high level” API uses paths. A lot of C based projects use the low-level API, but for example ssh-fs uses the high-level. fuse-rs only implements the low-level API.

Anyway, I’ve made a writeup/survey here:

Also, this document is rather pre-mature/rough as kind of a brainstorming effort, but it should help you get the thinking a bit. I discovered crdt-tree sort of late in the game, so it’s not exactly a cohesive design yet.

Sounds great! As you’ll see from the brainstorming doc, I have kind of a high level design in mind, but haven’t come up with finer details such as an API specification, though I think we can kind of use fuse API and C lib filesystem calls as a starting point for the needed functions. For the time being, I’m just trying to get the crdt-tree working, so we can start to get some hands on experience with that and see where it leads us. If you’d like to stub out a fuse impl, that could be a starting point on the other end. We could probably go with path based (fuse-mt) for now.

It would also be helpful to get some benchmark numbers against the current filesystem API using the mock network, so we can later show improvement (or not). This is a bit tricky though, because it serializes to/from the fake_vault_data.json file, and very very quickly the serialization dominates the CPU as that json file rapidly grows. That issue would need to be dealt with somehow, so maybe more trouble than its worth for now.

happybeing · July 21, 2020, 11:35am

Thanks @danda. The high level goals are very helpful - I’ve added them to the OP (this link goes directly to the heading: High Level Goals) so we can maintain and update them there, maybe add more detail as things clarify.

Wrt local-first, I think I now have an understanding of how Sequential CRDTs work (but maybe @bochaco can confirm). I summarised it in this post, but the part which I think differentiates from local-first is as follows:

local-first: ala Martin Kleppmann (link) entails each author/client having their own complete copy of the CRDT (content and CRDT metadata). Updates are made locally, but periodically, the changes (metadata) are shared with others who also have their own complete copy, possibly including different changes. When a client receives changes, it can merge them into its own copy and any two copies will converge on the same state when they have incorporated the same changes.
Sequence CRDT: is implemented so that only vaults hold a reference copy of the Sequence (content and CRDT metadata). When a client/author inserts/deletes it must first obtain a local copy (which is cached in memory to reduce fetches), and then applies the mutation locally. As soon as possible it also sends the mutation (CRDT operation) to all vaults which are looking after the reference copy, and as all the vaults merge the changes, each copy will reach the same ultimate state.

I think there are merits in both approaches, particularly for a filesystem. We’ll need to think about the trade-offs which would vary depending on application context.

Separately, I’ve been wondering about using a single FileTree CRDT analogous to a FilesContainer, but implemented as I’ve described for the Sequence CRDT. Also whether it is feasible to support nested FileTree CRDTs (I think it is) and whether than helps. I suspect that is a thing for later, but worth thinking about now even if the intention is to start simple (which I think would be wise!).

I think it may be too late, but at least we suffer the same madness

Excellent news. Well done man. Phew

I haven’t come across path based FUSE so will read your links. Thanks. My instinct is to stick with nodes unless we have good reason but I’ll read up and we can butt heads as necessary.

The following look sensible. Let’s return to them once I’ve read and thought a bit more.

danda:
Sounds great! As you’ll see from the brainstorming doc, I have kind of a high level design in mind, but haven’t come up with finer details such as an API specification, though I think we can kind of use fuse API and C lib filesystem calls as a starting point for the needed functions. For the time being, I’m just trying to get the crdt-tree working, so we can start to get some hands on experience with that and see where it leads us. If you’d like to stub out a fuse impl, that could be a starting point on the other end. We could probably go with path based (fuse-mt) for now.

It would also be helpful to get some benchmark numbers against the current filesystem API using the mock network, so we can later show improvement (or not). This is a bit tricky though, because it serializes to/from the fake_vault_data.json file, and very very quickly the serialization dominates the CPU as that json file rapidly grows. That issue would need to be dealt with somehow, so maybe more trouble than its worth for now.

bochaco · July 21, 2020, 12:53pm

The client also keeps a replica, each operation is made locally and right after broadcasted to the network’s replicas/Elders, this gives you the local-first/offline feature already. Depending on what type of client/app, it may be syncing the local replica with the remote replicas more or less often.

happybeing · July 21, 2020, 1:10pm

Thanks Gabriel, I missed that. This raises more questions! Presumably that has to be fetched as there is no client side storage AFAIK?

Please respond on the Sequence topic so we can gather the info there, and also check what I’ve written so far. I’ll ask my supplementaries there too.

happybeing · July 21, 2020, 5:47pm

@danda I made a diagram from your Code Layout – Thoughts

danda · July 21, 2020, 6:26pm

Looks great, thx! I’ve added it to the doc.

Some considerations:

safe-filetree probably needs to depend on safe-client-libs (at least) for network access.
There may be a need for some things in safe-api, I’m not sure yet.
safe-filetree could be considered a component of safe-api. At least safe-api may provide a higher level wrapper around it for certain operations.
I would like/prefer for safe-filetree to exist in its own crate. It remains to be seen if that would create a circular dependency with safe-api.

danda · July 21, 2020, 6:57pm

@happybeing I think your thoughts about tradeoffs with localfirst are good. In some situations, one may want to be interacting directly with the network data (effectively cache_size=0) and in others you might want to have a full local copy for working offline. Ideally we would be able to support both scenarios. More experiment/thinking/design will be needed around that.

For a first impl/prototype, I think it is fine just to say that local is a full copy.

nesting/linking seems to me something that could potentially evolve once we have building-blocks in place. But yes let’s keep it simple for now. crawl, walk, run.

Actually, I believe you have. syncer uses fuse-mt, which is one of the few (only?) rust fuse implementations that is path-based rather than inode.

Any inode based solution will have to somehow map the u64 inodes to filetree identifiers. I haven’t thought deeply about how to do that, so it’s an unsolved problem right now. If you would like to look into it, a starting point could be to review the survey/list of fuse implementations I made, and see how some of those projects go about it in their code.

danda · July 21, 2020, 8:13pm

Some initial results with my little php implementation of the tree crdt algo:

case 1

demonstrates that concurrent moves resolve as per paper. r1 moves a under b, r2 moves a under c. resolves to a under c, using last-writer-wins.

$ php tree.php 
Initial tree state on both replicas
- /
  - root
    - a
    - b
    - c

replica_1 tree after move
- /
  - root
    - b
      - a
    - c

replica_2 tree after move
- /
  - root
    - b
    - c
      - a

replica_1 state matches replica_2 state after each merges others change.  conflict resolved!
- /
  - root
    - b
    - c
      - a

Case 2

Concurrent moves that could introduce a cycle.

$ php tree.php 
Initial tree state on both replicas
- /
  - root
    - a
      - c
    - b

replica_1 tree after move
- /
  - root
    - a
      - c
      - b

replica_2 tree after move
- /
  - root
    - b
      - a
        - c

replica_1 state matches replica_2 state after each merges others change.  conflict resolved!

--replica_1 --
- /
  - root
    - a
      - c
      - b

--replica_2 --
- /
  - root
    - a
      - c
      - b

I will put the code up on github a bit later after some cleanup.

Notes

There is both a “/” node and a “root” node in the tree output. This is because the algo allows for multiple nodes without a parent, to eg support the filesystem root and trash existing as top-level nodes, and then a delete can be implemented as a move to trash. But for my print_tree() func, I needed a single node at the top, which is represented by “/”.
My first implementation is made using sets, to follow the original algo as closely as possible. I plan to implement the hash/map based version next.
The algo’s tree data structure is not truly a tree, but rather a set (unordered list) of triples. This makes some operations, such as finding a node’s parent quite inefficient. Even to print the tree as a tree, I have to first convert the triples to a real tree data structure.
Hopefully (3) will be improved somewhat by the map-based algo, as noted by the authors.
This impl cheats a bit by using a global timestamp counter. The paper discusses use of a lamport timestamp, but possibly/probably a vector clock would be better.
I’m unsure if it would be possible to use a real tree data structure and still satisfy the crdt requirements. If so, that seems better. But also I don’t really want to stray from the formally verified proofs/guarantees.

happybeing · July 21, 2020, 8:28pm

Great work @danda. You are progressing at a hell of a rate. I’m chilling watching a film, but also reading about FUSE low level to make up for my earlier ignorance. Of course it was the low level I wasn’t aware of. A few days ago I didn’t know what a vector clock was, a Lamport timestamp, and had no idea how a CDRT worked. This is fun

danda · July 22, 2020, 12:22am

As promised, I’ve put the experimental code up on github here.

I’ve commented as well as I can, including algo descriptions from the paper, so hopefully it is clear enough.

@happybeing if you have a chance to test it out and double-check that the code conforms with the algo description in the paper, that would be great. I could easily have missed something.

See the README.md for install/test instructions.

happybeing · July 22, 2020, 10:04am

Cool, I will add it to my list of tasks. I think that’s a sit down with a clear run job.

Should we / can we: Hard Links

Quick thought: it occurs to me that hard links may be relevant to whether we go for path or inode FUSE (cf. the point you found about them not being handled by FUSE caching in the high level API).

So I was thinking, if we can’t support them in a tree CDRT we may as well use the high level API. I haven’t looked hard at that question, but they look to me like a graph (ie nodes with multiple parents), in which case the tree-move op would be defeated.

So resolving early whether we can or want to support hard links seems useful.

I think it is desirable - they do have their uses, but probably not that many. I used them once for a one off data management system years ago, and I’m aware of one *nix backup script which uses them to make trees of backed up files, where each file is a hard link unless it has actually been modified. There must be lots of niche uses. Work arounds may exist for some uses but of course they will cause pain.

I’m not sure how many non *nix file systems support them, but I expect quite a few don’t, especially in the cloud.

I’d love to support them for completeness and the qdos of being one of the few systems which does, especially a cloud-like system. On the other hand it seems ok to not do them.

To add them later would likely be too big a change for us to bother, so I think it’s worth thinking it through a bit now as the decision may be final.

So, what chance we can support them with the current tree-move? I think that should be easy to answer.

Any thoughts?

danda · July 22, 2020, 5:51pm

I agree it is nice if we can support hard-links. Though I don’t know it is critical in any way. I’m not certain if fuse-mt does or does not support them. Maybe you can check into that?

Assuming it does not, an inode approach may introduce some complications.

So let’s think about child/parent identifiers in a SAFE filetree filesystem. So we start off with:

the filetree datatype (mount point), which has an XorName identifier for reaching it via XorUrl.
File content, which also has XorName identifier.
crdt-tree child_id/parent_id, which can be anything unique we define them to be. So these could be u64, or could be XorName, for example.

XorName is a 32 byte (256 bit) address space. u64 is tiny in comparison.

It could be nice if each parent/child id is itself a XorName. This makes every dir/file/symlink tree node directly addressable. But it it also uses more space storing those ids in the tree. And there isn’t a great way to map a u64 to an xorname due to the size difference. So I don’t really see how its workable in practice.

The other approach is like FileContainers presently do, which is that the FileContainer itself has an XorName but, paths are only relative to it. So one can never reach a directory by XorName, only by XorName+path. In this model, we could use a u64 for node identifier. The smaller address space is (I think) ok, because each FileTree mount point is its own namespace.

We still have the issue that FileContent is directly addressable via XorName. But I think that is fine. We just use an indirection. In the tree, we have nodes that can be of type: dir, symlink, or file. They each have a u64 id and metadata (name, etc). For file nodes, a metadata item can be “xorname”, which leads us to the actual content.

In this way then, we can support u64 inode directly in the tree structure, with no mapping required.

How does that sound?

happybeing · July 22, 2020, 6:47pm

It sounds, er sound and I’ve also been thinking about some of this so my agreement may even be significant, although I’m drinking wine again so that caveat applies.

One thing occurs atm… just a refinement on the point of using separate u64 inode/id space per FileTree.

This is fine except if we want to nest them. This is a special case and can in any case be handled easily if we add that ability, so I don’t think it affects the approach you outline. So the following point still stands, even with nested FileTrees I think.

u64 is still big enough for a large filesystem, which would I think need us to use multiple FileTree objects (or something similar) that would share the same ID space in order to allow file and directory moves (and hard links if we support them).

Been busy boating today so just keeping up and will go through recent posts properly when I get back to the keyboard.

danda · July 23, 2020, 2:08am

I’m not really contemplating nested FileTrees. Can you expand further on what you mean by that, and how it would work?

happybeing · July 23, 2020, 10:30am

It may be a way to limit the size of objects, and also a cache mechanism. I’m looking towards large filesystems.

A FilesContainer for a website is manageable for many use cases, but for a filesystem it will be desirable to have flexibility to help manage large numbers of nodes as people will quickly push them to the limit - eg a single hd backup.

There may be better ways of doing this. Maybe we can discuss options on a separate topic. My reason for thinking about it now is that having an idea of options for this in future can help us improve early design decisions. In this respect though, the hard links question is probably more important. I think scaling can be addressed later because it will have to be one way or another!

How? Nesting could be a bit like a mount, a FileTree inserted instead of a directory. The how and when to use a directory or a FT will be interesting, but I think might be handled with some sensible defaults and the option to adjust these for special use cases (eg parameters to a create function).

As noted, maybe there are better approaches, as Edward’s SafeDrive perhaps. I don’t have much experience in fs imp. or scaling techniques so just thinking on my feet.

danda · July 23, 2020, 2:58pm

I guess I’m not yet understanding the limitation you are trying to address.

As I see it, a FileTree is a mount point. It can hold a u64 amount of inodes, which is a lot, as many as any other unix filesystem.

If people need to transfer between FileTrees, they can do so with standard OS tools, eg mount two FileTree and then copy/move between.

The Safe API could (eventually) support operations across 2 FileTree as well.

what more is needed?

happybeing · July 23, 2020, 3:51pm

The limitation I’m thinking of is the size of the FileTree object.

Making crude guesses, let’s say each node holds:

at least a file/directory/symlink name (avg 10 chars)
directory entry per file entry (about 100 bytes of which, xor name 16B plus metadata about 80B)

Let’s say 100 ish bytes per node, which is about ten FS entries per KB of FileTree. I’ve not included any metadata for the CRDT in that. So for now let’s double this to 200 bytes or 5 entries per KB.

FileTree Entries	FileTree Size	-
1,000	200 KB
10,000	2 MB
100,000	20 MB	Would handle most websites
1,000,000	200 MB
10,000,000	2 GB	Minimum for a filesystem?
100,000,000	20 GB
1,000,000,000	200 GB

Note: CRDT metadata grows with every mutation, so the figures will get ‘worse’ over time. (Absolute max. capacity = 2^64 = 18,446,744,073,709,551,616 entries.)

I guess a million files is useful, but 10M or 100M sounds more reasonable targets in the short/medium term for a filesystem.

Short term we can just have a complete copy and only fetch it from the network when cloning the filesystem. But medium/long term I think we need a strategy which doesn’t store anything on the client and is performant as soon as you use a clean device. I just checked, but not storing data on the client is NOT one of the SAFE Fundamentals. It is though, something I’ve seen @dirvine be quite keen on. Maybe it needs adding?

Limiting the size of the FileTree object may or may not be the way to deal with this, but I’d feel better knowing that someone has some ideas about how to tackle this, or if not, then why its ok to not do so.

danda · July 23, 2020, 5:08pm

Nice calcs. I think they highlight that we need to be very careful/conservative with how we define inode metadata, so as to minimize the bytes/node. It will be instructive to look at how other filesystems do this.

As for the crdt metadata, I believe optimizations will be made over time. For example, in the CRDT: Hard Parts video, they present an algo for compressing the crdt metadata for the automerge project by approx 200x, if I remember correctly. I suspect something similar could be done with the tree crdt data.

First though, we need to prove this thing works, at all. As for @dirvine’s opinion on caching locally and working offline, I’d rather not speak for him.

ps, on linux one can see the number of inodes in use per mount with the df -i command, eg on my laptop:

$ df -i
Filesystem             Inodes   IUsed   IFree IUse% Mounted on
/dev/sda1             1831424  493179 1338245   27% /
/dev/sda5            10379264 1792590 8586674   18% /home

So my root dir (28G) is using 493,179 and my home dir (156G) is using 1,792,590.