Proposed changes in NFS

ustulation · September 24, 2016, 8:40pm

Ah right - this is already there i think. I am not very well-versed with Launcher and app side of things but remember having seen a JSON sending an array of objects for some bulk update - so that is doable. Though on this i would rather let @Krishna or somebody else comment.

These should be easy (if not already there) - as long as it’s one request. So if you ask safe_core to delete two files it can do it and POST and everything works. However if you send two different simultaneous (or in close succession) delete requests for each file without waiting for the 1st to succeed, then there is a problem as two POST requests each without waiting for the other would fly into the network and one would invalidate the other. (assuming everything’s under same folder - otherwise they wouldn’t interfere anyway)

You can think of versioning as keeping a Vector<DataMaps> instead of a single DataMap. So that vector can grow indefinitely and cost storage while an unversioned would simply mean replacing the previous DataMap with new one. For small files they would be inside the DataMap inline, large ones would have metadata (pre-and-post encryption hashes of chunks) inside DM but chunks elsewhere in the Network. So a large file is sort of permanently there in the network (the chunks) as ImmutableData - and this is where i said always-on versioning could make sense. Small files being inline with DM will get overwritten each time DM is overwritten in StructuredData - and this is where i mentioned the caveat/trade-off.

Traktion · September 24, 2016, 8:43pm

A vote for 1 here too BTW.

SVN does not track the history of directory changes (just file changes within a normalised path) and it is designed as a version control system. If it is sufficient for it, I suspect it will be sufficient for safe net. If more sophisticated version system is needed, I suspect it could be added at the app level.

ustulation · September 24, 2016, 8:45pm

This will be a big and separate discussion - i would suggest a different thread for it. In short, things are singed and POSTs currently require version field increment by vaults. So if two POSTs come to vault in succession with same version, the second one will be rejected because vaults cannot bump the version themselves to accept it as that would invalidate the signature (which user did with his secret-sign-key and vaults ofcourse don’t have this).

[Edit: or of-course you change the entire mechanism of structured data to resolve conflicts in other ways so that vaults could self-heal etc - but those are big topics with great performance and security implications and deserve a thread of their own]

Fraser · September 26, 2016, 9:11am

My vote is for 3, or a variant of it.

I think we could implement a simplified approach where we do version directories (would be easiest if SD were updated to accommodate these requirements) and don’t version files. Regarding the two problems/questions about this approach:

Root would be unaware of dir-b at all times wouldn’t it?

In the same way that Git disallows (easily) rewriting history, if we disallow branching of the SD, then the answer is “no”. We’d effectively be making all versions of a given dir read-only except the latest one.

I agree that it’s probably too time-consuming to implement an optimal solution right now where we store deltas for each version, but we could relatively easily take the wasteful approach of storing each version as a complete dir listing and when we have time we could optimise this. The changes at that stage should be transparent to the end users (except for an improvement in performance) - the behaviour and API should remain the same.

I also think that we could easily support branching of the history by changing how SD works, but we wouldn’t need to expose this via the NFS API. NFS could continue to not support rewriting history and not allow conflicting versions of a given file to coexist, but the low-level API exposing raw SD operations would enable apps to support these things.

Another disadvantage of only versioning files (which I don’t think is mentioned?) is that a file which gets removed from a dir cannot be recovered. I would think recovering deleted files would be one of the more desirable features of a versioned filesystem?

Versioning dirs would avoid this issue, and I think that it would also encourage batch changes which should be easier for the network to handle (e.g. if we’re adding 100 files to a dir, it would be cheaper for the user and easier for the network if that was put as a single version change to the dir rather than 100 individual changes).

ustulation · September 26, 2016, 10:08am

Both are good points - should be noted.

This sort of scare me a little - especially given the decentralised nature and all - how would things affect performance ?

If you have something (like google-docs ?) that saves changes often (flushing to the network as things could be shareable), generating a new dir-listing along with new chunks (of ImmutData if the file was larger than that contained inline in DM) can cause a lot of data floating on the network - how would that play with churn ? You could design around pointers and indirections but that would cause performance loses and unless pseudo-something is flushed out it is difficult to see how much those things would affect churn, performance etc.

Consider a snapsot that looks like: root/dir-a/dir-b and then root/dir-a. Should root be considered to be in a new version ? What would you expect ? I would for instance (and maybe I am wrong in this) expect this to be reflected in different versions of root and pull a version should restore the tree snap shot of that point - because in my head, a dir is just not what’s immediately under it, but the entire subtree under it.

ustulation · September 26, 2016, 10:15am

OTOH thinking a little more - maybe not. Again taking google-docs (which is versioned AFAIK) cannot be recovered if you deleted it (form the bin too). While in git you can go back to a commit which had the file you had deleted later. IDK which is in more demand - but here’s another way to look at it - provide a simpler, more measurable stuffs while allowing app’s to build more complex things if required ? How would that pan out ?

rob · September 26, 2016, 10:23am

Why not leave the ability to go back to any old version up to an APP which copies the datamap of each version and records it against that version. This then allows the ability to get back any copy of the file without putting any burden on the structure or SAFE code. Only need to be able to copy the datamap.

ustulation · September 26, 2016, 10:29am

Yes i would prefer this too. To be more precise [quote=“ustulation, post:23, topic:134”]
but here’s another way to look at it - provide a simpler, more measurable stuffs while allowing app’s to build more complex things if required ? How would that pan out ?
[/quote]

By this logic, NFS would do just the file versioning mentioned in OP-point-1. Apps could use Low-level-apis (we will make them more powerful if needs be) to design what-ever system they want and however complicated they wanted. The other apps can chose if their requirement is fulfilled by what’s provided by the nfs api’s, or they require more granularity and want to use those other apps to provide the base (for e.g if someone builds git-like stuff on SAFE).

Is that what you meant too ?

Fraser · September 26, 2016, 10:35am

I would expect it to improve performance. I don’t know how Git works under the hood, but at a very basic level I’d expect the optimisation to be something like changing from storing the complete dir listing each time to only storing a diff (if the diff is smaller than the whole updated dir listing) with a full listing stored every so often as a checkpoint (so we’re not having to apply a gazillion diffs to an antique version). So the frequency of version puts wouldn’t change - only the size of the data being put in each version.

I agree here, but I’d argue our NFS API wouldn’t be best suited for this use-case. For something like that, I’d advocate using the low-level API and interact directly using SDs (which I think can be designed to support this use-case - within reason).[quote=“ustulation, post:22, topic:134”]
Consider a snapsot that looks like: root/dir-a/dir-b and then root/dir-a. Should root be considered to be in a new version ? What would you expect ?
[/quote]

I’d expect not (assuming we’re not doing something like updating a “last-modified” timestamp for dir-a).

That shouldn’t be affected. I’d expect the latest version of root to show one subdir - namely dir-a. That’s the same situation for both snapshots. Subsequently pulling dir-a would give the user the choice of version n (latest) with no dir-b or version n-1 with dir-b showing.

ustulation · September 26, 2016, 10:42am

Do have a look here too. What do you think of that ?

Fraser · September 26, 2016, 10:42am

Not sure how one is more measurable than the other, and I don’t see one as significantly simpler than the other (assuming we disallow branching of history). We’re either storing versions of a given file as an SD containing the metadata and datamap for that file, or we’re storing versions of a given dir containing the contents of that dir.

rob · September 26, 2016, 10:44am

Basically. Along the lines of keep it simple and provide the functionality for APPs to do the extras whichever way the APP builder wants

Fraser · September 26, 2016, 10:44am

You only had to give me about another 10 seconds to post my reply! Are you watching me type so you can ninja me?

ustulation · September 26, 2016, 10:59am

We moved 1430 miles in those 10 sec through the universe - no joke man

Ok i just check google-drive (merely due to convenience - feel free to point out other e.g’s) - it doesn’t seem to work the way you say.

A version of a directory shows the entire snapshot of the sub-tree at that point of time. Even if deep down something was removed etc.
(This i already mentioned though) - you can’t seem restore something you deleted permanently (it goes into bin first and then delete it from there).

My point being that - what we think as versioning may not be what ppl expect. Just go with the simplest case and let others build on top of it if required.

ustulation · September 26, 2016, 11:25am

For this you are ok with the additional cost of 1 SD per file (for indirection) right (to buy you the ability for simultaneous updates to different files in same folder {as opposed to bulk updates which will be possible without it too}). @DavidMtl @Traktion (since you both like the post - i take it as an agreement to it )

Those are not big data, so most of 100 KiB (and maybe more if expanded) space of SD will be wasted in that.
Will lead to a slightly increased complication if you see sharing - you’ll have to travel all the file indirections to structured-data and modify those to include owners as well. So previously if there were root/a/b with each have 1000 files under them, you would need to add owners to only 3 SD’s. With the above change you would require to add/remove/manipulate owners from 3003 SD’s.

Maybe more i haven’t noticed.

But is everyone ok with that ^^^ ?

Fraser · September 26, 2016, 11:25am

I guess I’m thinking of Git, Subversion, Apple’s Time Machine, MS’s System Restore for example.

So, to be clear, say with your example using root, dir-a and dir-b, currently we only have root and dir-a, but yesterday we also had dir-b. If we view the current tree, we see only root and dir-a - there’s no mention of dir-b. If we view a snapshot from yesterday, we see all three dirs. Is that what you’re saying?

If so, that’s what I’m saying would happen with dir versioning too.

Right - but for the other examples you can restore deleted items.

Agreed. (We’re probably both wrong!)

I don’t agree that file versioning is simpler than dir versioning; I think they’re comparable in complexity. However I’m somewhat convinced that dir versioning could be cheaper for the users and easier for the network to handle.

Fraser · September 26, 2016, 11:35am

I’m not

Obviously my main concern is what we’re debating in the concurrent discussion (where I prefer dir versioning to avoid the cost of 3003 SDs), but other than that, I think it would be worthwhile adding a Vec<u8> to FurtherInfo too so that if the user has opaque data which needs to be changed when the file’s contents change the user isn’t forced to update the FileMetadata.

ustulation · September 26, 2016, 12:31pm

Me neither. It is more complicated and decreases performance - but waiting for @DavidMtl and @Traktion 's opinions.

So should this not discourage us from providing dir-versioning ? File versioning can be interpreted only in one way while for dir there seems to be divergence.

Can you give a modification of types if we were to go the dir-versioning way ?

Quick modification of the structure in O.P link would look like this:

/// `StructuredData::data(encrypted(serialised(DirListing)));`
/// where:
// =================================
enum DirListing {
    Versioned(Vec<Dir>),
    Unversioned(Dir),
}

struct Dir {
    sub_dirs: Vec<DirMetadata>,
    files: Vec<FileMetadata>,
}

// =================================
// If shared ppl can update this independently of each other, attaching it to any existing tree they have.
struct DirMetadata {
    locator: DataIdentifier, // DataIdentifier::Structured(UNVERSIONED_SD_TYPE_TAG, XorName)
    encrypt_key: Option<secretbox::Key>,
    name: String,
    created: Time,
    modified: Time,
    user_metadata: Vec<u8>,
}

// =================================
struct FileMetadata {
    name: String,
    size: u64,
    created: Time,
    modified: Time,
    user_metadata: Vec<u8>,
    data_map: DataMap,
}

You can take it from there ^^^ if you want.

DavidMtl · September 26, 2016, 1:58pm

(Emphasis mine)

Honestly I’m divided now.

I like splitting the metadata because it allows multiple people to modify, simultaneously, different files located in the same directory. But they won’t be able to create or delete multiple files, so it’s only half of the problem that it fixes.

I don’t like the extra cost in SDs. By taking this route we just multiplied by 2 the amount of addressable data stored on Safe. Is it a problem? I don’t know, by I’d like the network to stay as lean as possible.

If bulk updates are possible by using the low level API that might work out ok. So I would be manipulating SDs directly. For examples update the content of all files first and then modify the list of files in the parent-directory manually. Though this means that this operation is not atomic which might be a problem if two person is trying to do it at the same time. Maybe adding a locking system would be good but then someone could keep it locked by accident and you would have to create a way to handle that too.

Maybe what is needed is to keep in mind that Safe is a new beast and trying to do things the way we are used to might not be the best approach, like trying to fit a square peg inside a round hole. Maybe we need more experience to really grasp its natural limitation and not try to fix them by adding complexity but instead learn to work around them.

I vote for that. Keep it simple, give us tools to manipulate the lowest level of the API so we can figure out how to do what we need. After all, I could roll my own file system by handling SDs directly if I don’t like what NFS is doing.

ustulation · September 26, 2016, 3:17pm

This ^^^ - there are so many things we continue to un-learn.[quote=“DavidMtl, post:36, topic:134”]
I vote for that. Keep it simple, give us tools to manipulate the lowest level of the API so we can figure out how to do what we need. After all, I could roll my own file system by handling SDs directly if I don’t like what NFS is doing.
[/quote]

Thanks for your opinions (everybody else too) - That’s my vote too and i think it is a majority right now. Continue to critically evaluate (if you have anything you want to say).