Allocating inode numbers using pre-allocated blocks

EDIT: it looks like we have a solution which avoids the need for inode numbers to be common across all replicas, and can instead use inode UUIDs allocated independently in each replica. See @danda’s post

Context

This proposal related to the implementation of FileTree CRDT for SAFE Network and features here and here [ADDLINK]

We have the problem of how to allocate inode numbers for different replicas so that they don’t clash (the same inode number be used for different filesystem objects in different replicas). Ultimately we require all replicas agree on the same inode number for the same filesystem object (directory, file or symlink).

An inode number is a u64, so we have enough space to use a hash of a UUID for the object, and avoid collisions stop king as there aren’t larger numbers associated. Some rough calculations were done to explore this here.

Below are alternative ways of allocating inode numbers that could eliminate collisions by accepting some restrictions (optionally at the time the filesystem is created), or may make the effects more manageable even without restrictions. Or not. It’s just an idea we can look into at some point.

Allocating Blocks of ino’s

Suppose that instead of hashing the inode UUID to choose its inode number (ino), every new replica starts by pre-allocating a block of ino’s. The second replica will see the allocated range and be able to allocate a block for itself when created. And so on, with the possibility of blocks being allocated by more than one replica being something we will have to handle.

Allocation strategies with limits

Different allocation strategies can be tailored for different use cases, and in many situations could completely eliminate collisions (eg by anticipating the number of replicas and dividing the ino’s between them at the start).

  • setting limits could be a very good solution so I think we should consider implementing this at least as an option set when a FileTree is created.

But what about this as a general purpose alternative?

Allocation without limits

Now return to the general case where limits are not pre-defined.

This is similar to the UUID hash problem, in that collisions can’t be avoided, but maybe they can still be managed better by for example, being more predictable, or offer a cleaner way of resolving them with manual intervention.

Recap

To reiterate, the general proposal is that every new replica starts by pre-allocating itself a block of ino’s and these allocations are stored in the TreeCDRT. The second replica will see the allocated range and be able to allocate a block for itself when created. And so on.

Before its blocks depletes, a replica will allocate another so that it will normally have inode numbers available for exclusive use whenever it creates a new inode.

Because these allocations live in the TreeCRDT, any replica can see which ranges are taken and which are free. (Or rather which were free up to the point where it has merged the allocations of other replicas).

Multiple ore-allocations

It may be better that each replica allocates several blocks at random ranges over a period of time and always aims to use the oldest block first, to reduce the chances of a collision invalidating all its pre-allocated ino’s at once.

The more time between allocating ino’s and using them, the lower the chance another replica will have begun using them without seeing the allocation by a different replica. And the smaller the blocks, the lower the chance that all of a replica’s pre-allocations will be caught in a single collision.

Contrast with UUID hash

It is the use of both sequence (in using ino’s inside the block) and randomness (when choosing the range for a block) that distinguishes this strategy from just hashing the inode UUID, which I think is a bit like a special case of the algorithm I’m outlining.

Handling collisions

At some point we can see there can be a collision. We can calculate the risk of this and perhaps come up with an algorithm that makes this easier to handle than just using a hash of the UUID.

CDRT log cut-off

It also gives us a way to define the cut off for merging changes (raised here).

‘True’ block ownership could be decided based on a majority of replicas which have already merged the claim. Changes from a replica which has allocated inode numbers from a block which is deemed owned and in use by another replica would not be accepted from that point, hence the cut-off would apply to that replica only.

A managed merge would have to be performed with the dissenting replica discarding its rejected changes, re syncing with the others, and then manually re-applying any discarded changes that are to be retained.

I haven’t done any maths, just jotting down the idea for later consideration. I realise it might be rubbish!

1 Like

Overall this is a nice idea about pre-allocating blocks, and I will think about it more. But initially I don’t see how it can prevent conflicts, given the asynchronous and local-first nature of the replicas.

What prevents these blocks of inodes from conflicting? eg, let’s say that replica A starts up, allocates its inodes, etc. Then replica B and C start up at about the same time. Neither is aware of the other so they both use the same ino values when pre-allocating blocks.

This problem could be lessened if each replica starts allocating at some random value, however there doesn’t seem to be any more guarantee against conflicts than just hashing UUID.

Please let me know if I’ve missed something.

I’ve suggested two variations, one tries to use limits to try and avoid collisions by fixing allocations at the start while the second aims to manage and provide a way to recover from them. So I assume you’ve spotted a problem with the first and I may well have written too hastily.

I think the limits idea hinges on being able to allocate the blocks to each replica (authority), and each replica to be able to identify itself in order to know which block belongs to it. Maybe that can’t be done - is that the issue?

Anyway, food for thought. For now I think go with UUID hashes as we may find there are harder nuts to crack first and we could keep the risk of colliding ino’s low by imposing some safety limits (an artificial ceiling on the number of inodes or something like that).

EDIT: I wonder if we can solve the problem of allocating blocks of ino’s to a replica a bit like file locks. Thinking about that it feels like I’m going round in circles. For example, if we used a Sequence to hold the locks I think we end up with the same problem because that’s also a CDRT and we can’t know whether we locked a block, or if another replica has grabbed it.

We may need a ‘lock’ data type on SAFE! It would need to be non-CDRT, with the lock result decided by the network and returned immediately to the client. Is there a way to do this with the existing data types I wonder?

yes, indeed food for thought.

Another idea (with its own problems) would be to partition by replica_id. So that each replica gets it own range/block of inode numbers to allocate from/manage.

Let’s say that we define a maximum number of replicas, for all time. Perhaps 1024 (2^10). Each replica can then have a space of size 2^64/2^10, which is 2^54, ie 18,014,398,509,481,984, still a very large number.

The trouble is that there can/will still be hash collisions even coming up with the replica spaces. Eg replica_id % 1024 is likely to collide somewhere. And the set of replicas is not known.

I’m by no means an expert in this area. There may be a good solution out there…

Putting a hard-cap on the number of replicas (present+future) is a drawback with this approach.


It could be that in practice hashing a UUID to u64 results in so few collisions that we can pretty much ignore it. Conflict resolution could be either first-writer-wins, or last-writer-wins.

I’m inclined to go with this for now.

1 Like

It looks like we have a solution which avoids the need for inode numbers to be common across all replicas, and can instead use inode UUIDs allocated independently in each replica. See @danda’s post.

[I’ve added this note to the top of the OP.]