Proposed changes in NFS

Fraser · September 26, 2016, 9:11am

My vote is for 3, or a variant of it.

I think we could implement a simplified approach where we do version directories (would be easiest if SD were updated to accommodate these requirements) and don’t version files. Regarding the two problems/questions about this approach:

Root would be unaware of dir-b at all times wouldn’t it?

In the same way that Git disallows (easily) rewriting history, if we disallow branching of the SD, then the answer is “no”. We’d effectively be making all versions of a given dir read-only except the latest one.

I agree that it’s probably too time-consuming to implement an optimal solution right now where we store deltas for each version, but we could relatively easily take the wasteful approach of storing each version as a complete dir listing and when we have time we could optimise this. The changes at that stage should be transparent to the end users (except for an improvement in performance) - the behaviour and API should remain the same.

I also think that we could easily support branching of the history by changing how SD works, but we wouldn’t need to expose this via the NFS API. NFS could continue to not support rewriting history and not allow conflicting versions of a given file to coexist, but the low-level API exposing raw SD operations would enable apps to support these things.

Another disadvantage of only versioning files (which I don’t think is mentioned?) is that a file which gets removed from a dir cannot be recovered. I would think recovering deleted files would be one of the more desirable features of a versioned filesystem?

Versioning dirs would avoid this issue, and I think that it would also encourage batch changes which should be easier for the network to handle (e.g. if we’re adding 100 files to a dir, it would be cheaper for the user and easier for the network if that was put as a single version change to the dir rather than 100 individual changes).

ustulation · September 26, 2016, 10:08am

Both are good points - should be noted.

This sort of scare me a little - especially given the decentralised nature and all - how would things affect performance ?

If you have something (like google-docs ?) that saves changes often (flushing to the network as things could be shareable), generating a new dir-listing along with new chunks (of ImmutData if the file was larger than that contained inline in DM) can cause a lot of data floating on the network - how would that play with churn ? You could design around pointers and indirections but that would cause performance loses and unless pseudo-something is flushed out it is difficult to see how much those things would affect churn, performance etc.

Consider a snapsot that looks like: root/dir-a/dir-b and then root/dir-a. Should root be considered to be in a new version ? What would you expect ? I would for instance (and maybe I am wrong in this) expect this to be reflected in different versions of root and pull a version should restore the tree snap shot of that point - because in my head, a dir is just not what’s immediately under it, but the entire subtree under it.

ustulation · September 26, 2016, 10:15am

OTOH thinking a little more - maybe not. Again taking google-docs (which is versioned AFAIK) cannot be recovered if you deleted it (form the bin too). While in git you can go back to a commit which had the file you had deleted later. IDK which is in more demand - but here’s another way to look at it - provide a simpler, more measurable stuffs while allowing app’s to build more complex things if required ? How would that pan out ?

rob · September 26, 2016, 10:23am

Why not leave the ability to go back to any old version up to an APP which copies the datamap of each version and records it against that version. This then allows the ability to get back any copy of the file without putting any burden on the structure or SAFE code. Only need to be able to copy the datamap.

ustulation · September 26, 2016, 10:29am

Yes i would prefer this too. To be more precise [quote=“ustulation, post:23, topic:134”]
but here’s another way to look at it - provide a simpler, more measurable stuffs while allowing app’s to build more complex things if required ? How would that pan out ?
[/quote]

By this logic, NFS would do just the file versioning mentioned in OP-point-1. Apps could use Low-level-apis (we will make them more powerful if needs be) to design what-ever system they want and however complicated they wanted. The other apps can chose if their requirement is fulfilled by what’s provided by the nfs api’s, or they require more granularity and want to use those other apps to provide the base (for e.g if someone builds git-like stuff on SAFE).

Is that what you meant too ?

Fraser · September 26, 2016, 10:35am

I would expect it to improve performance. I don’t know how Git works under the hood, but at a very basic level I’d expect the optimisation to be something like changing from storing the complete dir listing each time to only storing a diff (if the diff is smaller than the whole updated dir listing) with a full listing stored every so often as a checkpoint (so we’re not having to apply a gazillion diffs to an antique version). So the frequency of version puts wouldn’t change - only the size of the data being put in each version.

I agree here, but I’d argue our NFS API wouldn’t be best suited for this use-case. For something like that, I’d advocate using the low-level API and interact directly using SDs (which I think can be designed to support this use-case - within reason).[quote=“ustulation, post:22, topic:134”]
Consider a snapsot that looks like: root/dir-a/dir-b and then root/dir-a. Should root be considered to be in a new version ? What would you expect ?
[/quote]

I’d expect not (assuming we’re not doing something like updating a “last-modified” timestamp for dir-a).

That shouldn’t be affected. I’d expect the latest version of root to show one subdir - namely dir-a. That’s the same situation for both snapshots. Subsequently pulling dir-a would give the user the choice of version n (latest) with no dir-b or version n-1 with dir-b showing.

ustulation · September 26, 2016, 10:42am

Do have a look here too. What do you think of that ?

Fraser · September 26, 2016, 10:42am

Not sure how one is more measurable than the other, and I don’t see one as significantly simpler than the other (assuming we disallow branching of history). We’re either storing versions of a given file as an SD containing the metadata and datamap for that file, or we’re storing versions of a given dir containing the contents of that dir.

rob · September 26, 2016, 10:44am

Basically. Along the lines of keep it simple and provide the functionality for APPs to do the extras whichever way the APP builder wants

Fraser · September 26, 2016, 10:44am

You only had to give me about another 10 seconds to post my reply! Are you watching me type so you can ninja me?

ustulation · September 26, 2016, 10:59am

We moved 1430 miles in those 10 sec through the universe - no joke man

Ok i just check google-drive (merely due to convenience - feel free to point out other e.g’s) - it doesn’t seem to work the way you say.

A version of a directory shows the entire snapshot of the sub-tree at that point of time. Even if deep down something was removed etc.
(This i already mentioned though) - you can’t seem restore something you deleted permanently (it goes into bin first and then delete it from there).

My point being that - what we think as versioning may not be what ppl expect. Just go with the simplest case and let others build on top of it if required.

ustulation · September 26, 2016, 11:25am

For this you are ok with the additional cost of 1 SD per file (for indirection) right (to buy you the ability for simultaneous updates to different files in same folder {as opposed to bulk updates which will be possible without it too}). @DavidMtl @Traktion (since you both like the post - i take it as an agreement to it )

Those are not big data, so most of 100 KiB (and maybe more if expanded) space of SD will be wasted in that.
Will lead to a slightly increased complication if you see sharing - you’ll have to travel all the file indirections to structured-data and modify those to include owners as well. So previously if there were root/a/b with each have 1000 files under them, you would need to add owners to only 3 SD’s. With the above change you would require to add/remove/manipulate owners from 3003 SD’s.

Maybe more i haven’t noticed.

But is everyone ok with that ^^^ ?

Fraser · September 26, 2016, 11:25am

I guess I’m thinking of Git, Subversion, Apple’s Time Machine, MS’s System Restore for example.

So, to be clear, say with your example using root, dir-a and dir-b, currently we only have root and dir-a, but yesterday we also had dir-b. If we view the current tree, we see only root and dir-a - there’s no mention of dir-b. If we view a snapshot from yesterday, we see all three dirs. Is that what you’re saying?

If so, that’s what I’m saying would happen with dir versioning too.

Right - but for the other examples you can restore deleted items.

Agreed. (We’re probably both wrong!)

I don’t agree that file versioning is simpler than dir versioning; I think they’re comparable in complexity. However I’m somewhat convinced that dir versioning could be cheaper for the users and easier for the network to handle.

Fraser · September 26, 2016, 11:35am

I’m not

Obviously my main concern is what we’re debating in the concurrent discussion (where I prefer dir versioning to avoid the cost of 3003 SDs), but other than that, I think it would be worthwhile adding a Vec<u8> to FurtherInfo too so that if the user has opaque data which needs to be changed when the file’s contents change the user isn’t forced to update the FileMetadata.

ustulation · September 26, 2016, 12:31pm

Me neither. It is more complicated and decreases performance - but waiting for @DavidMtl and @Traktion 's opinions.

So should this not discourage us from providing dir-versioning ? File versioning can be interpreted only in one way while for dir there seems to be divergence.

Can you give a modification of types if we were to go the dir-versioning way ?

Quick modification of the structure in O.P link would look like this:

/// `StructuredData::data(encrypted(serialised(DirListing)));`
/// where:
// =================================
enum DirListing {
    Versioned(Vec<Dir>),
    Unversioned(Dir),
}

struct Dir {
    sub_dirs: Vec<DirMetadata>,
    files: Vec<FileMetadata>,
}

// =================================
// If shared ppl can update this independently of each other, attaching it to any existing tree they have.
struct DirMetadata {
    locator: DataIdentifier, // DataIdentifier::Structured(UNVERSIONED_SD_TYPE_TAG, XorName)
    encrypt_key: Option<secretbox::Key>,
    name: String,
    created: Time,
    modified: Time,
    user_metadata: Vec<u8>,
}

// =================================
struct FileMetadata {
    name: String,
    size: u64,
    created: Time,
    modified: Time,
    user_metadata: Vec<u8>,
    data_map: DataMap,
}

You can take it from there ^^^ if you want.

DavidMtl · September 26, 2016, 1:58pm

(Emphasis mine)

Honestly I’m divided now.

I like splitting the metadata because it allows multiple people to modify, simultaneously, different files located in the same directory. But they won’t be able to create or delete multiple files, so it’s only half of the problem that it fixes.

I don’t like the extra cost in SDs. By taking this route we just multiplied by 2 the amount of addressable data stored on Safe. Is it a problem? I don’t know, by I’d like the network to stay as lean as possible.

If bulk updates are possible by using the low level API that might work out ok. So I would be manipulating SDs directly. For examples update the content of all files first and then modify the list of files in the parent-directory manually. Though this means that this operation is not atomic which might be a problem if two person is trying to do it at the same time. Maybe adding a locking system would be good but then someone could keep it locked by accident and you would have to create a way to handle that too.

Maybe what is needed is to keep in mind that Safe is a new beast and trying to do things the way we are used to might not be the best approach, like trying to fit a square peg inside a round hole. Maybe we need more experience to really grasp its natural limitation and not try to fix them by adding complexity but instead learn to work around them.

I vote for that. Keep it simple, give us tools to manipulate the lowest level of the API so we can figure out how to do what we need. After all, I could roll my own file system by handling SDs directly if I don’t like what NFS is doing.

ustulation · September 26, 2016, 3:17pm

This ^^^ - there are so many things we continue to un-learn.[quote=“DavidMtl, post:36, topic:134”]
I vote for that. Keep it simple, give us tools to manipulate the lowest level of the API so we can figure out how to do what we need. After all, I could roll my own file system by handling SDs directly if I don’t like what NFS is doing.
[/quote]

Thanks for your opinions (everybody else too) - That’s my vote too and i think it is a majority right now. Continue to critically evaluate (if you have anything you want to say).

ustulation · September 26, 2016, 3:34pm

Yes in the back of my mind i believe that low-level is the way to go - it is almost impossible to design high-level for every use case out there - it has never worked in my experience. I think the best way is to give good tools (make low-level powerful) and let ppl make apps using them rather than compound operations or provide specialised apis, because the permutations are infinite and one wouldn’t know what is the user-base of one’s specialised api - all you know you catered to 2 devs in the world.

With this i would vote on keeping the so called high-level in form of nfs as simple as possible. Would love to hear thoughts on this too.

Traktion · September 26, 2016, 3:58pm

ustulation:

For this you are ok with the additional cost of 1 SD per file (for indirection) right (to buy you the ability for simultaneous updates to different files in same folder {as opposed to bulk updates which will be possible without it too}). @DavidMtl @Traktion (since you both like the post - i take it as an agreement to it )

Those are not big data, so most of 100 KiB (and maybe more if expanded) space of SD will be wasted in that.

Will lead to a slightly increased complication if you see sharing - you’ll have to travel all the file indirections to structured-data and modify those to include owners as well. So previously if there were root/a/b with each have 1000 files under them, you would need to add owners to only 3 SD’s. With the above change you would require to add/remove/manipulate owners from 3003 SD’s.

Maybe more i haven’t noticed.

But is everyone ok with that ^^^ ?

Hmm… it doesn’t sound so good with the cons! As @DavidMtl suggested, this will not help with additions too.

Maybe the pros don’t outweigh the cons and a better solution is needed?

Traktion · September 26, 2016, 4:15pm

Perhaps we are trying to get parallel performance in the wrong way here.

The whole point of updating multiple files at once is to increase throughput. That is, we use multiple connections to fill available bandwidth, because a single connection cannot fill it. This doesn’t result in any single file uploading more quickly, but throughput increases. However, what we really want is faster single connections, to use all available bandwidth and have faster uploads.

This makes me think that multiplexing over a single connection is actually more desirable. If we can update a single file, by updating multiple chunks simultaneously, with the meta data being updated after the final chunk has been uploaded, we achieve high performance and avoid the mutation errors when multiple files are changed simultaneously.

This may not help so much if there are lots of small files being uploaded, but it would definitely help with the larger ones.

Is this an option? Is it already being done or is it something which can be done without adding complexity to the client?