EDIT: it looks like we have a solution which avoids the need for inode numbers to be common across all replicas, and can instead use inode UUIDs allocated independently in each replica. See @danda’s post
We have the problem of how to allocate inode numbers for different replicas so that they don’t clash (the same inode number be used for different filesystem objects in different replicas). Ultimately we require all replicas agree on the same inode number for the same filesystem object (directory, file or symlink).
An inode number is a u64, so we have enough space to use a hash of a UUID for the object, and avoid collisions stop king as there aren’t larger numbers associated. Some rough calculations were done to explore this here.
Below are alternative ways of allocating inode numbers that could eliminate collisions by accepting some restrictions (optionally at the time the filesystem is created), or may make the effects more manageable even without restrictions. Or not. It’s just an idea we can look into at some point.
Allocating Blocks of ino’s
Suppose that instead of hashing the inode UUID to choose its inode number (ino), every new replica starts by pre-allocating a block of ino’s. The second replica will see the allocated range and be able to allocate a block for itself when created. And so on, with the possibility of blocks being allocated by more than one replica being something we will have to handle.
Allocation strategies with limits
Different allocation strategies can be tailored for different use cases, and in many situations could completely eliminate collisions (eg by anticipating the number of replicas and dividing the ino’s between them at the start).
- setting limits could be a very good solution so I think we should consider implementing this at least as an option set when a
But what about this as a general purpose alternative?
Allocation without limits
Now return to the general case where limits are not pre-defined.
This is similar to the UUID hash problem, in that collisions can’t be avoided, but maybe they can still be managed better by for example, being more predictable, or offer a cleaner way of resolving them with manual intervention.
To reiterate, the general proposal is that every new replica starts by pre-allocating itself a block of ino’s and these allocations are stored in the
TreeCDRT. The second replica will see the allocated range and be able to allocate a block for itself when created. And so on.
Before its blocks depletes, a replica will allocate another so that it will normally have inode numbers available for exclusive use whenever it creates a new inode.
Because these allocations live in the
TreeCRDT, any replica can see which ranges are taken and which are free. (Or rather which were free up to the point where it has merged the allocations of other replicas).
It may be better that each replica allocates several blocks at random ranges over a period of time and always aims to use the oldest block first, to reduce the chances of a collision invalidating all its pre-allocated ino’s at once.
The more time between allocating ino’s and using them, the lower the chance another replica will have begun using them without seeing the allocation by a different replica. And the smaller the blocks, the lower the chance that all of a replica’s pre-allocations will be caught in a single collision.
Contrast with UUID hash
It is the use of both sequence (in using ino’s inside the block) and randomness (when choosing the range for a block) that distinguishes this strategy from just hashing the inode UUID, which I think is a bit like a special case of the algorithm I’m outlining.
At some point we can see there can be a collision. We can calculate the risk of this and perhaps come up with an algorithm that makes this easier to handle than just using a hash of the UUID.
CDRT log cut-off
It also gives us a way to define the cut off for merging changes (raised here).
‘True’ block ownership could be decided based on a majority of replicas which have already merged the claim. Changes from a replica which has allocated inode numbers from a block which is deemed owned and in use by another replica would not be accepted from that point, hence the cut-off would apply to that replica only.
A managed merge would have to be performed with the dissenting replica discarding its rejected changes, re syncing with the others, and then manually re-applying any discarded changes that are to be retained.
I haven’t done any maths, just jotting down the idea for later consideration. I realise it might be rubbish!