Proposed changes in NFS

ustulation · September 23, 2016, 7:09pm

No way - i hope i didn’t give you that impression too Though as developers we usually do want our stuff to be critically evaluated - the whole point of putting it public, because if we don’t get one (critical evaluation) now, we will down the road and refactors become exponentially difficult as you amass sub-optimal designs/codes etc. I guess we discuss among ourselves so much that we tend to become very terse with an intention of avoid being vague and to the point, but it might come as weird to others - we are not supposed to be sane anyway

happybeing · September 23, 2016, 7:21pm

np, I wanted to check so by all means keep being clear/terse!

DavidMtl · September 24, 2016, 1:05pm

I didn’t realize folders were versioned like files are. That sounds like quite a complex feature to have, dropping that can only make the code simpler.

So if I understand correctly these new changes will allow us to post and delete multiple files at the same time correct? If that is the case that’s a pretty big improvement.

ustulation · September 24, 2016, 2:48pm

Ok - i’ll happily count this as an upvote for Point/Approach 1

Yes - provided they don’t update the same SD. So if they are in different branches of the tree (assuming we go with point 1) or different folders which are not immediate parent-child simultaneous updates should hopefully be no problem.

[Edit: If they do update the same SD however then you can make all changes locally and do a bulk update. Only if you were running that app from different machines you have logged into simultaneously (hopefully that is not an awfully common use-case) will there be a chance of SD conflict in vaults, in which case apps would have to repost (or do some recovery). So it’s all left to the apps to optimise various approaches.]

DavidMtl · September 24, 2016, 6:26pm

When you say the same SD you talk about the SD that contains the parent-directory metadata right? So I could update the content of multiple files located in the same folder without problem but if I add or remove a file from the directory then I can only do it one file at the time, correct (I mean using the NFS API)? And this is because the parent-directory metadata, which content the list of all its files, needs to be updated correct?

In other word, I should use the low level api instead of the NFS api in order to do bulk updates? I don’t remember who said that but someone pointed out that this require a level of understanding of the underlying mechanism of the API that developers shouldn’t need to care about when working with the API. It would be good if this was abstracted by the launcher. Anyway that’s not really relevant to the current discussion.

From a user and dev perspective I would say bulk updates is a very important feature. I need to be able to create, update and delete multiple files in the most straight-forward way and also in a timely manner without having to know about the underlying working of the network.

Adding back the ability to update the content of a single file. Right now we need to delete it first and recreate it, it’s not very efficient. I know it’s only a temporary situation though, no worries.

I’m honestly over my head with this, I don’t have the experience to know which is best but if it was my pet project I would go with the first one. It’s simple, proven to work and it would probably make Safe’s file system more compatible with other tech.

I’m also not sure about the whole versioning thing. Are all files versioned by default? So every time a change is made a new file is stored on the network and the old copy is kept? Do we have this feature because the network didn’t know who owns a file and since we can’t delete old files we might as well make it a feature and keep a link to it so it’s not completely wasted? I’m a big fan of turning a limitation into a feature but if that’s the case it looks like the network will be filled by unused files. Is that what deletable immutable data is trying to fix? Which makes me think, is a file an immutable data?

Eheh sorry I’m bombarding you with questions, feel free to not answer any of it

EDIT: I just realized I’m talking to two different person. Sorry if it sounds a bit confusing.

ustulation · September 24, 2016, 7:01pm

No, not with the current structure - because DataMap, size, modified time stamp etc (the things that changes when contents change) are all part of metadata stored with parent folder. It is for performance reason. If there are 100 files in a dir you dont have to do 100 gets to retrieve the metadata. So you can only modify files in different dirs simultaneously.

However if this is like a the feature that would be greatly missed if absent then what we can do is change:

struct FileMetadata {
    name: String,
    size: u64,
    created: Time,
    modified: Time,
    user_metadata: Vec<u8>,
    data_map: DataMap,
}

to

struct FileMetadata {
    name: String,
    created: Time,
    user_metadata: Vec<u8>,
    further_info: DataIdentifier, // Points to SD::data(FurtherInfo)
}

struct FurtherInfo {
    modified: Time,
    size: u64,
    data_map: DataMap,
}

That way, only the relatively unchanging parts remain in metadata stored with parent. Others are offloaded via pointer to elsewhere.

Now you can edit and post contents of files in same folder simultaneously too at a slight overhead of indirection and hence performance. However you won’t be able to post the changes in name or user-metadata for files in same folder simultaneiously (should be done serially). To do that you will need to offload the whole metadata to some location, but this will have significant performance impact. If a dir had 100 files, previously you could display all of them at one go on entering it (because name and user-metadata {for and special icons etc}) was all there. Now you would need to make 100 gets for merely that. And then separate ones for contents of each. So there will be a trade off.

What do you think about this ^^^ ?

That depends on the app - anything striving for performance should be well aware of it’s environment though. Launcher cannot know what the app wants - if it wants a bulk update or immediate post or something else etc. But we can discuss if you give more details on what kind of app you have in mind and what are your expectation from the backend for it.

ustulation · September 24, 2016, 7:23pm

There is no default a.t.m - api will ask explicitly.

This is a nice proposal - we can do this actually. However the downside is you will be keeping all the previous data-maps with you - and (counter-intuitively) smaller files have larger data-map (if the file is stored inline). So 1000 versions of a file of particular size could cost you 2 MiB storage space - (equivalent of roughly 20 StructuredData’s). Whereas if user could opt in to not version then (s)he would have only one copy of DataMap in the SD (every new copy would overwrite the previous one) {== only 1KiB maybe} and no other data lying around in the network (true for all files less than a particular size which are stored inline in DataMap).

So it’s a tradeoff again - but interesting to discuss further.

DavidMtl · September 24, 2016, 8:15pm

ustulation:

That way, only the relatively unchanging parts remain in metadata stored with parent. Others are offloaded via pointer to elsewhere.

Now you can edit and post contents of files in same folder simultaneously too at a slight overhead of indirection and hence performance. However you won’t be able to post the changes in name or user-metadata for files in same folder simultaneiously (should be done serially). To do that you will need to offload the whole metadata to some location, but this will have significant performance impact. If a dir had 100 files, previously you could display all of them at one go on entering it (because name and user-metadata {for and special icons etc}) was all there. Now you would need to make 100 gets for merely that. And then separate ones for contents of each. So there will be a trade off.

What do you think about this ^^^ ?

I like this. Being able to update the content of multiple files simultaneously is a big improvement. Of course the draw back is if someone wants to show all files in a directory with its size they need to send a request for each of them. It think it’s worth it though. When you navigate through directories the most important info is the name of the file. The size can be gradually retrieved. I’m biased though because I don’t need it. Doesn’t help for creating and deleting multiple files though.[quote=“ustulation, post:15, topic:134”]
This is a nice proposal - we can do this actually. However the downside is you will be keeping all the previous data-maps with you - and (counter-intuitively) smaller files have larger data-map (if the file is stored inline). So 1000 versions of a file of particular size could cost you 2 MiB storage space - (equivalent of roughly 20 StructuredData’s). Whereas if user could opt in to not version then (s)he would have only one copy of DataMap in the SD (every new copy would overwrite the previous one) {== only 1KiB maybe} and no other data lying around in the network (true for all files less than a particular size which are stored inline in DataMap).
[/quote]

Oh, I thought that’s how it was working. I’m not familiar with the inner workings so I tend to extrapolate from what I know. So how is the versioning done exactly for file that are small enough to be directly inside the datamap and for bigger file? Can you very briefly describe the process of a roll back for example?

[quote=“ustulation, post:14, topic:134”]
That depends on the app - anything striving for performance should be well aware of it’s environment though. Launcher cannot know what the app wants - if it wants a bulk update or immediate post or something else etc.[/quote]

I’m more talking of a way for an app to inform the launcher that it will do multiple similar requests and to wait until they are all done before commiting the changes to Safe. A bit like how you do transactions with a database. Not sure how it would translate for the Safe API. Or maybe just add new API endpoints for bulk requests.

In short, I’d like a way to create, update or delete multiple files without waiting for each request to complete before doing the next. Splitting the file metadata into two would work well for updating multiple files. Adding an API endpoint to delete multiple files could also work, this endpoint could take a directory path and a list of filename to delete. Creating multiple files might be more complex since I don’t think you could do that with only one API call. That’s why a transaction system might be more elegant if possible.

Makes me wonder how the demo app does it. I guess it takes the long road and do each request one at the time…

Traktion · September 24, 2016, 8:19pm

If there are two simultaneous updates to files in a folder, must it result in a mutation error?

Presumably, the network knows that another update is occurring, which has essentially locked the SD, which causes the mutation error. Can’t these changes be queued and let the network arrive at eventual consistency (lazy write) once the queue has been processed? Is there a need for the file data and the meta data to change atomically? Could there be a read lock on the meta data when there is a queue of changes in flight to avoid this issue?

I don’t know how the mechanics of this work under the bonnet, but I suspect answering these questions will be insightful.

ustulation · September 24, 2016, 8:40pm

Ah right - this is already there i think. I am not very well-versed with Launcher and app side of things but remember having seen a JSON sending an array of objects for some bulk update - so that is doable. Though on this i would rather let @Krishna or somebody else comment.

These should be easy (if not already there) - as long as it’s one request. So if you ask safe_core to delete two files it can do it and POST and everything works. However if you send two different simultaneous (or in close succession) delete requests for each file without waiting for the 1st to succeed, then there is a problem as two POST requests each without waiting for the other would fly into the network and one would invalidate the other. (assuming everything’s under same folder - otherwise they wouldn’t interfere anyway)

You can think of versioning as keeping a Vector<DataMaps> instead of a single DataMap. So that vector can grow indefinitely and cost storage while an unversioned would simply mean replacing the previous DataMap with new one. For small files they would be inside the DataMap inline, large ones would have metadata (pre-and-post encryption hashes of chunks) inside DM but chunks elsewhere in the Network. So a large file is sort of permanently there in the network (the chunks) as ImmutableData - and this is where i said always-on versioning could make sense. Small files being inline with DM will get overwritten each time DM is overwritten in StructuredData - and this is where i mentioned the caveat/trade-off.

Traktion · September 24, 2016, 8:43pm

A vote for 1 here too BTW.

SVN does not track the history of directory changes (just file changes within a normalised path) and it is designed as a version control system. If it is sufficient for it, I suspect it will be sufficient for safe net. If more sophisticated version system is needed, I suspect it could be added at the app level.

ustulation · September 24, 2016, 8:45pm

This will be a big and separate discussion - i would suggest a different thread for it. In short, things are singed and POSTs currently require version field increment by vaults. So if two POSTs come to vault in succession with same version, the second one will be rejected because vaults cannot bump the version themselves to accept it as that would invalidate the signature (which user did with his secret-sign-key and vaults ofcourse don’t have this).

[Edit: or of-course you change the entire mechanism of structured data to resolve conflicts in other ways so that vaults could self-heal etc - but those are big topics with great performance and security implications and deserve a thread of their own]

Fraser · September 26, 2016, 9:11am

My vote is for 3, or a variant of it.

I think we could implement a simplified approach where we do version directories (would be easiest if SD were updated to accommodate these requirements) and don’t version files. Regarding the two problems/questions about this approach:

Root would be unaware of dir-b at all times wouldn’t it?

In the same way that Git disallows (easily) rewriting history, if we disallow branching of the SD, then the answer is “no”. We’d effectively be making all versions of a given dir read-only except the latest one.

I agree that it’s probably too time-consuming to implement an optimal solution right now where we store deltas for each version, but we could relatively easily take the wasteful approach of storing each version as a complete dir listing and when we have time we could optimise this. The changes at that stage should be transparent to the end users (except for an improvement in performance) - the behaviour and API should remain the same.

I also think that we could easily support branching of the history by changing how SD works, but we wouldn’t need to expose this via the NFS API. NFS could continue to not support rewriting history and not allow conflicting versions of a given file to coexist, but the low-level API exposing raw SD operations would enable apps to support these things.

Another disadvantage of only versioning files (which I don’t think is mentioned?) is that a file which gets removed from a dir cannot be recovered. I would think recovering deleted files would be one of the more desirable features of a versioned filesystem?

Versioning dirs would avoid this issue, and I think that it would also encourage batch changes which should be easier for the network to handle (e.g. if we’re adding 100 files to a dir, it would be cheaper for the user and easier for the network if that was put as a single version change to the dir rather than 100 individual changes).

ustulation · September 26, 2016, 10:08am

Both are good points - should be noted.

This sort of scare me a little - especially given the decentralised nature and all - how would things affect performance ?

If you have something (like google-docs ?) that saves changes often (flushing to the network as things could be shareable), generating a new dir-listing along with new chunks (of ImmutData if the file was larger than that contained inline in DM) can cause a lot of data floating on the network - how would that play with churn ? You could design around pointers and indirections but that would cause performance loses and unless pseudo-something is flushed out it is difficult to see how much those things would affect churn, performance etc.

Consider a snapsot that looks like: root/dir-a/dir-b and then root/dir-a. Should root be considered to be in a new version ? What would you expect ? I would for instance (and maybe I am wrong in this) expect this to be reflected in different versions of root and pull a version should restore the tree snap shot of that point - because in my head, a dir is just not what’s immediately under it, but the entire subtree under it.

ustulation · September 26, 2016, 10:15am

OTOH thinking a little more - maybe not. Again taking google-docs (which is versioned AFAIK) cannot be recovered if you deleted it (form the bin too). While in git you can go back to a commit which had the file you had deleted later. IDK which is in more demand - but here’s another way to look at it - provide a simpler, more measurable stuffs while allowing app’s to build more complex things if required ? How would that pan out ?

rob · September 26, 2016, 10:23am

Why not leave the ability to go back to any old version up to an APP which copies the datamap of each version and records it against that version. This then allows the ability to get back any copy of the file without putting any burden on the structure or SAFE code. Only need to be able to copy the datamap.

ustulation · September 26, 2016, 10:29am

Yes i would prefer this too. To be more precise [quote=“ustulation, post:23, topic:134”]
but here’s another way to look at it - provide a simpler, more measurable stuffs while allowing app’s to build more complex things if required ? How would that pan out ?
[/quote]

By this logic, NFS would do just the file versioning mentioned in OP-point-1. Apps could use Low-level-apis (we will make them more powerful if needs be) to design what-ever system they want and however complicated they wanted. The other apps can chose if their requirement is fulfilled by what’s provided by the nfs api’s, or they require more granularity and want to use those other apps to provide the base (for e.g if someone builds git-like stuff on SAFE).

Is that what you meant too ?

Fraser · September 26, 2016, 10:35am

I would expect it to improve performance. I don’t know how Git works under the hood, but at a very basic level I’d expect the optimisation to be something like changing from storing the complete dir listing each time to only storing a diff (if the diff is smaller than the whole updated dir listing) with a full listing stored every so often as a checkpoint (so we’re not having to apply a gazillion diffs to an antique version). So the frequency of version puts wouldn’t change - only the size of the data being put in each version.

I agree here, but I’d argue our NFS API wouldn’t be best suited for this use-case. For something like that, I’d advocate using the low-level API and interact directly using SDs (which I think can be designed to support this use-case - within reason).[quote=“ustulation, post:22, topic:134”]
Consider a snapsot that looks like: root/dir-a/dir-b and then root/dir-a. Should root be considered to be in a new version ? What would you expect ?
[/quote]

I’d expect not (assuming we’re not doing something like updating a “last-modified” timestamp for dir-a).

That shouldn’t be affected. I’d expect the latest version of root to show one subdir - namely dir-a. That’s the same situation for both snapshots. Subsequently pulling dir-a would give the user the choice of version n (latest) with no dir-b or version n-1 with dir-b showing.

ustulation · September 26, 2016, 10:42am

Do have a look here too. What do you think of that ?

Fraser · September 26, 2016, 10:42am

Not sure how one is more measurable than the other, and I don’t see one as significantly simpler than the other (assuming we disallow branching of history). We’re either storing versions of a given file as an SD containing the metadata and datamap for that file, or we’re storing versions of a given dir containing the contents of that dir.