Proposed changes in NFS

nbaksalyar · September 23, 2016, 10:32am

Over a year ago there was an idea to use only symmetric keys for NFS encryption. Asymmetric public and private keys are redundant in the current implementation, as files and directories are encrypted only to be used by their owners.

In addition to that, symmetric keys allow more granular access to directories: if a user wants to share read access, they only need to give someone a unique symmetric key for that directory. To give a write access a user just adds a public sign key of another person to the list of directory owners.

We’ve decided to implement these changes this week and it looks like they will affect the file system design in a bigger way. We want to discuss these changes with the community.

Current state of NFS

The current design of NFS allows users to have versioned directories and files. These are used in the case if a user wants to restore deleted files or roll back changes. However, if we allow shared access to directories and use unique symmetric keys per directory for encryption, this feature starts to get complicated.

First of all, each time we modify a file or a directory the parent directory metadata has to be updated too. NFS directories store the following info in the metadata structure:

github.com

maidsafe/safe_client_libs/blob/3a5da6850e97dab4cd18f84136e84a2438677187/src/nfs/metadata/directory_metadata.rs#L27-L34


pub struct DirectoryMetadata {
    key: DirectoryKey,
    name: String,
    created_time: ::time::Tm,
    modified_time: ::time::Tm,
    user_metadata: Vec<u8>,
    parent_dir_key: Option<DirectoryKey>,
}

Now, consider this hierarchy:

root/
|- dir-a
|   `- dir-b
`- file-0

Here are some of the problems and questions we’ve come across:

If root is versioned and later dir-b was removed from dir-a would we want this to reflect on root as well?
If we restore something to version v and start making changes there, should it branch off at that point?

If we change the name of dir-b, it would be updated and its metadata would be updated in dir-a. This will create a new version of dir-a, v0 -> v1. Now files were added and other operations were done on dir-b which does not affect dir-a. Now a user choses to restore to version v0 for dir-a. It will show the metadata of dir-b as it was when dir-a was at v0. However, if we fetched dir-b using metadata in dir-a, we would get a latest dir-b. How should we design that dir-b corresponding to that time when dir-a was at v0 is fetched and how would this work recursively if dir-b had children too?

Proposed solutions

One of the options is to get rid of versioned directories, allowing only files to be versioned. This approach solves many of the raised problems while also allowing to restore files one-by-one.

The proposed structure can be found in this Gist: https://gist.github.com/ustulation/b1009a943ac9deed5331b5d5d1003f20 (you can see the context and discussion there, but we’re aiming for the topmost update-1)

Advantages:
This approach simplifies things a lot conceptually. For e.g. if versioning a directory is seen as restoration of entire tree structure then it is better handed over to operations similar to restore-point creation (snapshot of the entire tree) instead of using versioning in directories to keep track of that. It is also very wasteful in the sense that if a dir has 100 files then a new version of the dir will be created for any file being modified (even if rest 99 haven’t changed).

Disadvantages:
Versioned directories might still be useful for some users and developers. Something to consider is if this is not a wide use case then we already provide tools (low-level-apis) for those who want to code a custom made versioning system - so might come into the app-devs realm to do anything more complex.
Another option is to use a flat hierarchy to store files, akin to cloud object storages like AWS S3, etc. Using this scheme we have Buckets instead of Dirs and Objects instead of Files. Buckets don’t have any pointers to subdirectories or parents and objects don’t have modification time in their metadata.

Proposed design can be found here:
https://gist.github.com/madadam/1248971e056f6b83d687f658fd84c794

Advantages:
It simplifies the NFS architecture a lot and makes it simpler to understand.
We can retain versioning of directories/buckets in some form.
And the file system can be still organised as a hierarchical tree. E.g. consider Amazon’s approach, where the file tree is derived from the object names (e.g. an object named “a/b/file” can be represented as a tree “dir a => dir b => file”).

Disadvantages:
A large number of objects in a bucket might have a bigger performance cost than the hierarchical FS.
Third option is to redesign the file system to make it similar to Git or some other disitributed version control system. Instead of applying versioning on a per-file or per-directory basis, we can store versions for the entire file system tree so that every change would be reflected only on the root version. Shared directories can be implemented as separate trees inside a root file system (i.e. similar to git submodules).

Advantages:
This approach resolves confusion about versioning of directories and their metadata. Users can snapshot and then roll back a state of the entire file system.

Disadvantages:
It’s too complex to implement in a reasonable time, so we’re not really considering this option now. Another disadvantage would be performance cost given our view of network. Depending on how it is implemented (pointers or delta storage) it might fall flat due to performance hits. Further we can stick to the principle of providing the tools (Low-level-api) so that if someone wants to build such a structure they can do it using them instead of MaidSafe investing effort into it.
Possibly some other approach can be used here. If you have ideas, we’d love to hear them!

We need help from the community to answer the following questions:

What NFS features are important to you?
In your opinion, what parts of the NFS API should we prioritise?
What option would you prefer?

Shared write access

There’s another open question about shared write access: currently vaults check that there’s a consensus on changes, and for that data has to be signed with at least a half of the owner keys. Which means that e.g. in order to modify shared data you have to ask other owners to sign StructuredData that you constructed, or your changes will be rejected by vaults.

This is something that must be changed at the vaults end somehow - either invent a type-tag to do something different or change the way these things work by having some kind of a weight field in StructuredData that gives additional information on how much is each person’s weight in the modification is. In this case the weight would be 100% for each owner so that when anyone adds their signature it is enough for the change to be accepted by vaults.

ustulation · September 23, 2016, 2:03pm

While flat-filesystem does appear to be simple, the performance cost would grow with the growth of number of objects. However i think it is necessary to actually get a feel of it to make a decision so i will just attempt to provide some perspective which you guys can then comment on.

Every file undergoes Self-Encryption (SE) and DataMap (DM) is stored in the metadata. As an approximation of DM, consider it stores pre and post ecryption hashes per chunk which currently can be max 1 MiB. These have been brought down from sha512 to sha256 so it’s 64 B per 1 MiB of data. So for a file size of 800 MiB (movie) you will have DM of size 64 * 800 / 1024 = 50 KiB (if we used sha512 it would be 100 KiB).

Edit:
[ This is assuming big files. If instead of few (50) 800 MiB files if you assume 10000 1 KiB files then the DataMap actually stores the files inline. In this case the Metadata itself will grow to aprrox 10 MiB. So 4 MiB above with e.g.'s of only big files is sort of a better case scenario - on average it will be worse as there will be a mix of small. medium and large files. ]

Now if you have a total of 40 GiB of data, you would roughly have a total of 2.5 MiB in various DM’s. Lets take another 1.5 MiB for metadata other than DM’s and other overheads (serialisation, wrappers, if some files are versioned then versioning overhead etc). So roughly 4 MiB of data.

Now the consequences:

Manipulation of anything in that scenario will entail a fetch/modify/write of 4 MiB of data over and above the particular data itself - no matter how small the the data or the modification of data itself was. We might be tempted to find solutions to optimise the storage to minimise this but that might defeat the purpose of using a flat hierarchy - dividing/segmenting could gradually lead the backend to slowly become a tree hierarchy.
We cannot store anything above 100 KiB in SD - so this will be stored as a pointer in SD to an ImmutableData. Every change would render the previous ImmutatbleData invalid and generate a new 4 MiB chunk, thereby being a lot of wastage in network (none of these are likely to be anywhere close to de-duplication with anything else in the network).

This can be avoided if deletable ImmutableData RFC gets through.
If we store it as a Doubly Linked List of SD’s. That would entail the complexity of keeping track of other SD fileds (owners, keys, signatures) and would require around 40 SD’s to maintain data and ability to modify SD’s in middle of the chain, encrypt each one separately (else we would require to modify everyone if any data changed). etc.

Tree structure is better suited (at-least in my opinion) with our mental model of organisation. We would not normally want to see 10000 files in a single directory. So data will need to be extracted and represented in form of tree although it is stored in flat hierarchy behind the scenes. To make this efficient we will have to code good algorithms. We should not launch a search of 5000 files on average to find a file out of 10000 files we have. Similarly for name conflicts etc. Simple sorting may not be enough. Consider we have tree dir-0/dir-1/dir-2 and want file a to go in dir-1 file b in dir-2 and file e in dir-1. In flat hierarchy we can approximate the file names as dir-0.dir-1.a , dir-0.dir-1.dir-2.b and dir-0.dir-1.e. Simple sorting would lead to the order (lexicographic-ally ascending) dir-0.dir-1.a , dir-0.dir-1.dir-2.b and dir-0.dir-1.e. So to display everything under dir-0.dir-1 we would need to touch dir-0.dir-1.dir-2.b as all of dir-0.dir-1 are not grouped together. Thus we would need a little different sorting scheme than what first meets the eye else there would be massive fragmentation.

With POSIX FS hierarchy, all of these are negated (unless of-course one puts all data in the same folder.). Reduction in Manipulation cost, reduction in network storage wastage and increase in performance would increase substantially for data size < 1.6 GiB (appox) per folder as that would be contained entirely within a single SD and its manipulation will not affect other SD’s representing sibling directories or directories on other branches of the tree.

Further this will allow much better concurrency as many apps can modify their respective folders/data without being mutexed. For flat hierarchy there can be only one app doing a POST at a time which will invalidate the POSTs of other Apps as SD version would have changed even if they were addressing very different pieces of unrelated data.

happybeing · September 23, 2016, 2:26pm

@nbaksalyar:

We’ve decided to implement these changes1 this week and it looks like they will affect the file system design in a bigger way. We want to discuss these changes with the community.

To what extent are the issues with versioning a consequence of this decision?

ustulation · September 23, 2016, 2:59pm

Issues with versioning already exist - the (ideal) consequence of this discussion would be to resolve those.

But i doubt i got your question correct. Could you elaborate a little more pls ?

ustulation · September 23, 2016, 3:31pm

Also the current POSIX structure of file representation is:

enum File {
    Unversioned(FileMetadata),
    Versioned(Vec<FileMetadata>),
}

struct FileMetadata {
    name: String,
    size: u64,
    created: Time,
    modified: Time,
    user_metadata: Vec<u8>,
    data_map: DataMap,
}

I would suggest to change it to:

enum File {
    Unversioned(FileMetadata),
    Versioned {
        ptr_versions: DataIdentifier, // Points to ImmutableData(Vec<FileMetadata>) or Another SD::data(Vec<FileMetadata>)
        num_of_versions: u64,
        latest_version: FileMetadata,
}

So that if only one file out of 100 was versioned, it’s version list does not grow and kick the whole thing out of being packed into a SD making rest of the 99 that could have been packed into SD suffer.

happybeing · September 23, 2016, 4:24pm

The quote I included suggests that to some degree the problems are a result of the change (linked to in the OP). So I’m interested in how much is related to that decision, and how much is regardless of that change - motivated because there really don’t seem to be any stand out good options here.

Were there any significantly better options without that change?

ustulation · September 23, 2016, 4:51pm

Ok, breaking it down:

Clariffying more: The problem exists in current code (without the proposed change) - versioning based on dir-level versioning which is there currently gets complex with private-shares and even without it has unanswered questions due to what would one expect from a dir (as opposed to a file) being versioned. If you are looking for a snapshot restoration of the sub-tree, it wouldn’t do that and other problems such as syncing version restoration down the sub-tree.

So it’s not related to that decision - rather reverse. Point about versioning of files only and not directory (which happens to be my personal choice too) simplifies and corrects almost all issues related to versioning with the added bonus of addressing private shares too. Dir sub tree restoration can then be looked upon as a separate restore-point like operation as explained there, instead of tying it to versioning operation.

Could you be more specific here - for e.g. why don’t you think (each separately or in conjunction) option (1) or (2) etc are not good enough to warrant an implementation ? what is it that would potentially deter you from implementing those ? and finally what would be your suggestion/preference (point 4 there) ? If you could answer those it would provide better insight and help us discuss further.

None that we could see - the previous structure (currently in code) is present, marked original in the gist linked:

We left that there so that if someone wants to have a look and suggest a different way forward from there we would like to hear that too. Discusssions that lead to the proposal of latest structure are also in that gist and in comments to that gist there.

happybeing · September 23, 2016, 5:54pm

Great, thanks for clarifying. I misunderstood the para that I quoted.

I’m saying that there are trade-offs in each case, and that an ideal would be to be able to version directory trees to make roll back simple (universal, no-brainer, no special apps etc.) were it not for the obvious performance issues which rule this out.

So far I’ve only read through once and don’t yet feel I understand the relative merits to offer a preference, and as for an extra alternative… in your dreams man. Though if I get to that point I’ll be sure to let you know!

I hope you didn’t read my query as critical, I was seeking understanding from a (familiar) position of ignorance

Thanks again for explaining.

ustulation · September 23, 2016, 7:09pm

No way - i hope i didn’t give you that impression too Though as developers we usually do want our stuff to be critically evaluated - the whole point of putting it public, because if we don’t get one (critical evaluation) now, we will down the road and refactors become exponentially difficult as you amass sub-optimal designs/codes etc. I guess we discuss among ourselves so much that we tend to become very terse with an intention of avoid being vague and to the point, but it might come as weird to others - we are not supposed to be sane anyway

happybeing · September 23, 2016, 7:21pm

np, I wanted to check so by all means keep being clear/terse!

DavidMtl · September 24, 2016, 1:05pm

I didn’t realize folders were versioned like files are. That sounds like quite a complex feature to have, dropping that can only make the code simpler.

So if I understand correctly these new changes will allow us to post and delete multiple files at the same time correct? If that is the case that’s a pretty big improvement.

ustulation · September 24, 2016, 2:48pm

Ok - i’ll happily count this as an upvote for Point/Approach 1

Yes - provided they don’t update the same SD. So if they are in different branches of the tree (assuming we go with point 1) or different folders which are not immediate parent-child simultaneous updates should hopefully be no problem.

[Edit: If they do update the same SD however then you can make all changes locally and do a bulk update. Only if you were running that app from different machines you have logged into simultaneously (hopefully that is not an awfully common use-case) will there be a chance of SD conflict in vaults, in which case apps would have to repost (or do some recovery). So it’s all left to the apps to optimise various approaches.]

DavidMtl · September 24, 2016, 6:26pm

When you say the same SD you talk about the SD that contains the parent-directory metadata right? So I could update the content of multiple files located in the same folder without problem but if I add or remove a file from the directory then I can only do it one file at the time, correct (I mean using the NFS API)? And this is because the parent-directory metadata, which content the list of all its files, needs to be updated correct?

In other word, I should use the low level api instead of the NFS api in order to do bulk updates? I don’t remember who said that but someone pointed out that this require a level of understanding of the underlying mechanism of the API that developers shouldn’t need to care about when working with the API. It would be good if this was abstracted by the launcher. Anyway that’s not really relevant to the current discussion.

From a user and dev perspective I would say bulk updates is a very important feature. I need to be able to create, update and delete multiple files in the most straight-forward way and also in a timely manner without having to know about the underlying working of the network.

Adding back the ability to update the content of a single file. Right now we need to delete it first and recreate it, it’s not very efficient. I know it’s only a temporary situation though, no worries.

I’m honestly over my head with this, I don’t have the experience to know which is best but if it was my pet project I would go with the first one. It’s simple, proven to work and it would probably make Safe’s file system more compatible with other tech.

I’m also not sure about the whole versioning thing. Are all files versioned by default? So every time a change is made a new file is stored on the network and the old copy is kept? Do we have this feature because the network didn’t know who owns a file and since we can’t delete old files we might as well make it a feature and keep a link to it so it’s not completely wasted? I’m a big fan of turning a limitation into a feature but if that’s the case it looks like the network will be filled by unused files. Is that what deletable immutable data is trying to fix? Which makes me think, is a file an immutable data?

Eheh sorry I’m bombarding you with questions, feel free to not answer any of it

EDIT: I just realized I’m talking to two different person. Sorry if it sounds a bit confusing.

ustulation · September 24, 2016, 7:01pm

No, not with the current structure - because DataMap, size, modified time stamp etc (the things that changes when contents change) are all part of metadata stored with parent folder. It is for performance reason. If there are 100 files in a dir you dont have to do 100 gets to retrieve the metadata. So you can only modify files in different dirs simultaneously.

However if this is like a the feature that would be greatly missed if absent then what we can do is change:

struct FileMetadata {
    name: String,
    size: u64,
    created: Time,
    modified: Time,
    user_metadata: Vec<u8>,
    data_map: DataMap,
}

to

struct FileMetadata {
    name: String,
    created: Time,
    user_metadata: Vec<u8>,
    further_info: DataIdentifier, // Points to SD::data(FurtherInfo)
}

struct FurtherInfo {
    modified: Time,
    size: u64,
    data_map: DataMap,
}

That way, only the relatively unchanging parts remain in metadata stored with parent. Others are offloaded via pointer to elsewhere.

Now you can edit and post contents of files in same folder simultaneously too at a slight overhead of indirection and hence performance. However you won’t be able to post the changes in name or user-metadata for files in same folder simultaneiously (should be done serially). To do that you will need to offload the whole metadata to some location, but this will have significant performance impact. If a dir had 100 files, previously you could display all of them at one go on entering it (because name and user-metadata {for and special icons etc}) was all there. Now you would need to make 100 gets for merely that. And then separate ones for contents of each. So there will be a trade off.

What do you think about this ^^^ ?

That depends on the app - anything striving for performance should be well aware of it’s environment though. Launcher cannot know what the app wants - if it wants a bulk update or immediate post or something else etc. But we can discuss if you give more details on what kind of app you have in mind and what are your expectation from the backend for it.

ustulation · September 24, 2016, 7:23pm

There is no default a.t.m - api will ask explicitly.

This is a nice proposal - we can do this actually. However the downside is you will be keeping all the previous data-maps with you - and (counter-intuitively) smaller files have larger data-map (if the file is stored inline). So 1000 versions of a file of particular size could cost you 2 MiB storage space - (equivalent of roughly 20 StructuredData’s). Whereas if user could opt in to not version then (s)he would have only one copy of DataMap in the SD (every new copy would overwrite the previous one) {== only 1KiB maybe} and no other data lying around in the network (true for all files less than a particular size which are stored inline in DataMap).

So it’s a tradeoff again - but interesting to discuss further.

DavidMtl · September 24, 2016, 8:15pm

ustulation:

That way, only the relatively unchanging parts remain in metadata stored with parent. Others are offloaded via pointer to elsewhere.

Now you can edit and post contents of files in same folder simultaneously too at a slight overhead of indirection and hence performance. However you won’t be able to post the changes in name or user-metadata for files in same folder simultaneiously (should be done serially). To do that you will need to offload the whole metadata to some location, but this will have significant performance impact. If a dir had 100 files, previously you could display all of them at one go on entering it (because name and user-metadata {for and special icons etc}) was all there. Now you would need to make 100 gets for merely that. And then separate ones for contents of each. So there will be a trade off.

What do you think about this ^^^ ?

I like this. Being able to update the content of multiple files simultaneously is a big improvement. Of course the draw back is if someone wants to show all files in a directory with its size they need to send a request for each of them. It think it’s worth it though. When you navigate through directories the most important info is the name of the file. The size can be gradually retrieved. I’m biased though because I don’t need it. Doesn’t help for creating and deleting multiple files though.[quote=“ustulation, post:15, topic:134”]
This is a nice proposal - we can do this actually. However the downside is you will be keeping all the previous data-maps with you - and (counter-intuitively) smaller files have larger data-map (if the file is stored inline). So 1000 versions of a file of particular size could cost you 2 MiB storage space - (equivalent of roughly 20 StructuredData’s). Whereas if user could opt in to not version then (s)he would have only one copy of DataMap in the SD (every new copy would overwrite the previous one) {== only 1KiB maybe} and no other data lying around in the network (true for all files less than a particular size which are stored inline in DataMap).
[/quote]

Oh, I thought that’s how it was working. I’m not familiar with the inner workings so I tend to extrapolate from what I know. So how is the versioning done exactly for file that are small enough to be directly inside the datamap and for bigger file? Can you very briefly describe the process of a roll back for example?

[quote=“ustulation, post:14, topic:134”]
That depends on the app - anything striving for performance should be well aware of it’s environment though. Launcher cannot know what the app wants - if it wants a bulk update or immediate post or something else etc.[/quote]

I’m more talking of a way for an app to inform the launcher that it will do multiple similar requests and to wait until they are all done before commiting the changes to Safe. A bit like how you do transactions with a database. Not sure how it would translate for the Safe API. Or maybe just add new API endpoints for bulk requests.

In short, I’d like a way to create, update or delete multiple files without waiting for each request to complete before doing the next. Splitting the file metadata into two would work well for updating multiple files. Adding an API endpoint to delete multiple files could also work, this endpoint could take a directory path and a list of filename to delete. Creating multiple files might be more complex since I don’t think you could do that with only one API call. That’s why a transaction system might be more elegant if possible.

Makes me wonder how the demo app does it. I guess it takes the long road and do each request one at the time…

Traktion · September 24, 2016, 8:19pm

If there are two simultaneous updates to files in a folder, must it result in a mutation error?

Presumably, the network knows that another update is occurring, which has essentially locked the SD, which causes the mutation error. Can’t these changes be queued and let the network arrive at eventual consistency (lazy write) once the queue has been processed? Is there a need for the file data and the meta data to change atomically? Could there be a read lock on the meta data when there is a queue of changes in flight to avoid this issue?

I don’t know how the mechanics of this work under the bonnet, but I suspect answering these questions will be insightful.

ustulation · September 24, 2016, 8:40pm

Ah right - this is already there i think. I am not very well-versed with Launcher and app side of things but remember having seen a JSON sending an array of objects for some bulk update - so that is doable. Though on this i would rather let @Krishna or somebody else comment.

These should be easy (if not already there) - as long as it’s one request. So if you ask safe_core to delete two files it can do it and POST and everything works. However if you send two different simultaneous (or in close succession) delete requests for each file without waiting for the 1st to succeed, then there is a problem as two POST requests each without waiting for the other would fly into the network and one would invalidate the other. (assuming everything’s under same folder - otherwise they wouldn’t interfere anyway)

You can think of versioning as keeping a Vector<DataMaps> instead of a single DataMap. So that vector can grow indefinitely and cost storage while an unversioned would simply mean replacing the previous DataMap with new one. For small files they would be inside the DataMap inline, large ones would have metadata (pre-and-post encryption hashes of chunks) inside DM but chunks elsewhere in the Network. So a large file is sort of permanently there in the network (the chunks) as ImmutableData - and this is where i said always-on versioning could make sense. Small files being inline with DM will get overwritten each time DM is overwritten in StructuredData - and this is where i mentioned the caveat/trade-off.

Traktion · September 24, 2016, 8:43pm

A vote for 1 here too BTW.

SVN does not track the history of directory changes (just file changes within a normalised path) and it is designed as a version control system. If it is sufficient for it, I suspect it will be sufficient for safe net. If more sophisticated version system is needed, I suspect it could be added at the app level.

ustulation · September 24, 2016, 8:45pm

This will be a big and separate discussion - i would suggest a different thread for it. In short, things are singed and POSTs currently require version field increment by vaults. So if two POSTs come to vault in succession with same version, the second one will be rejected because vaults cannot bump the version themselves to accept it as that would invalidate the signature (which user did with his secret-sign-key and vaults ofcourse don’t have this).

[Edit: or of-course you change the entire mechanism of structured data to resolve conflicts in other ways so that vaults could self-heal etc - but those are big topics with great performance and security implications and deserve a thread of their own]