Proposed changes in NFS

DavidMtl · September 26, 2016, 1:58pm

(Emphasis mine)

Honestly I’m divided now.

I like splitting the metadata because it allows multiple people to modify, simultaneously, different files located in the same directory. But they won’t be able to create or delete multiple files, so it’s only half of the problem that it fixes.

I don’t like the extra cost in SDs. By taking this route we just multiplied by 2 the amount of addressable data stored on Safe. Is it a problem? I don’t know, by I’d like the network to stay as lean as possible.

If bulk updates are possible by using the low level API that might work out ok. So I would be manipulating SDs directly. For examples update the content of all files first and then modify the list of files in the parent-directory manually. Though this means that this operation is not atomic which might be a problem if two person is trying to do it at the same time. Maybe adding a locking system would be good but then someone could keep it locked by accident and you would have to create a way to handle that too.

Maybe what is needed is to keep in mind that Safe is a new beast and trying to do things the way we are used to might not be the best approach, like trying to fit a square peg inside a round hole. Maybe we need more experience to really grasp its natural limitation and not try to fix them by adding complexity but instead learn to work around them.

I vote for that. Keep it simple, give us tools to manipulate the lowest level of the API so we can figure out how to do what we need. After all, I could roll my own file system by handling SDs directly if I don’t like what NFS is doing.

ustulation · September 26, 2016, 3:17pm

This ^^^ - there are so many things we continue to un-learn.[quote=“DavidMtl, post:36, topic:134”]
I vote for that. Keep it simple, give us tools to manipulate the lowest level of the API so we can figure out how to do what we need. After all, I could roll my own file system by handling SDs directly if I don’t like what NFS is doing.
[/quote]

Thanks for your opinions (everybody else too) - That’s my vote too and i think it is a majority right now. Continue to critically evaluate (if you have anything you want to say).

ustulation · September 26, 2016, 3:34pm

Yes in the back of my mind i believe that low-level is the way to go - it is almost impossible to design high-level for every use case out there - it has never worked in my experience. I think the best way is to give good tools (make low-level powerful) and let ppl make apps using them rather than compound operations or provide specialised apis, because the permutations are infinite and one wouldn’t know what is the user-base of one’s specialised api - all you know you catered to 2 devs in the world.

With this i would vote on keeping the so called high-level in form of nfs as simple as possible. Would love to hear thoughts on this too.

Traktion · September 26, 2016, 3:58pm

ustulation:

For this you are ok with the additional cost of 1 SD per file (for indirection) right (to buy you the ability for simultaneous updates to different files in same folder {as opposed to bulk updates which will be possible without it too}). @DavidMtl @Traktion (since you both like the post - i take it as an agreement to it )

Those are not big data, so most of 100 KiB (and maybe more if expanded) space of SD will be wasted in that.

Will lead to a slightly increased complication if you see sharing - you’ll have to travel all the file indirections to structured-data and modify those to include owners as well. So previously if there were root/a/b with each have 1000 files under them, you would need to add owners to only 3 SD’s. With the above change you would require to add/remove/manipulate owners from 3003 SD’s.

Maybe more i haven’t noticed.

But is everyone ok with that ^^^ ?

Hmm… it doesn’t sound so good with the cons! As @DavidMtl suggested, this will not help with additions too.

Maybe the pros don’t outweigh the cons and a better solution is needed?

Traktion · September 26, 2016, 4:15pm

Perhaps we are trying to get parallel performance in the wrong way here.

The whole point of updating multiple files at once is to increase throughput. That is, we use multiple connections to fill available bandwidth, because a single connection cannot fill it. This doesn’t result in any single file uploading more quickly, but throughput increases. However, what we really want is faster single connections, to use all available bandwidth and have faster uploads.

This makes me think that multiplexing over a single connection is actually more desirable. If we can update a single file, by updating multiple chunks simultaneously, with the meta data being updated after the final chunk has been uploaded, we achieve high performance and avoid the mutation errors when multiple files are changed simultaneously.

This may not help so much if there are lots of small files being uploaded, but it would definitely help with the larger ones.

Is this an option? Is it already being done or is it something which can be done without adding complexity to the client?

ustulation · September 26, 2016, 4:23pm

[Ignore:
DataMap, a part that belongs to the metadata, will be updated though for every change of file you publish to the network. If you want DataMap updates to not affect the metadata (which is stored in the parent dir) - the only way i can immediately think of is via the indirection : store is separately in SD (because SD’s are the only ones whose ID does not change even if data changes). That is what i mentioned here. And that will have all the problems in the post i tagged you.]

Edit: Ah shit - i missed the first part of the quoted sentence:

Yes this we can do and is currently done that way - the DataMap is returned (and hence the metadata with the parent updated) only after the final chunk is put out. Though right now the chunks are put serially i think they can be made to go parallely - although it affects the precision of the progress bar the fontend wanted (i know i know - tell them ).

[Btw does anyone know how to do a strike through - instead of Ignore i wrote above.]

happybeing · September 26, 2016, 4:54pm

I don’t feel knowledgeable enough to make sensible judgements about implementation issues here, so my thoughts are fairly high level and more user than builder or implementor oriented.

As a user, my preference was always 3, but until @Fraser intervened, it seemed pointless to advocate that based on what was said in the OP, because I couldn’t address the objections made against it.

As a user, I’ve never used Google Docs to a significant extent, so I think of filesystems in terms of Windows & Linux, particularly backup/restore (as in the kind of applications Fraser mentioned, and similar) and assume (perhaps incorrectly) that those cover most common user expectations and use cases well.

For anything else, of course providing the low level tools is very important.

So with the caveats about my input I’ve given, my preference is for the NFS level functionality include the ability to really roll back to version based on directory level, including deleted files. That makes it easy to restore whole trees, and even to find a particular version of a file at a given date even though the file versions aren’t indexed separately (but that would then be fairly easy to add in an App).

In this I don’t have much sense of performance issues, and I do agree that thinking differently with this new, er [hesitates,] paradigm [ducks] is both where the gold lies, and bloody difficult.

ustulation · September 26, 2016, 5:07pm

Fair enough - i have asked @Fraser to further elaborate on it to view merits demerits better and to understand what he exactly views as directory versioning and version restoration.

I’ll count this as a vote then for point-3.

ustulation · September 26, 2016, 5:17pm

To probe further - if you have root/a/b/c (snapshot-0), then root/a/b (snapshot-1), and so on. What do you mean by backup/restore ? Should the travel from snapshot-0 to snapshot-1 be reflective in root ? Something like root-v0 and root-v1 so when you say root-v0, you are no longer prompted which version of a once you enter it, and then not prompted which b etc. All that will happen is if you chose root-v1 you would expect is - i enter it, find a, enter it find b enter it find nothing <<< Would this be your expectation ?

happybeing · September 26, 2016, 5:42pm

If I understand you correctly, yes.

So going to a given version of root is unambiguous - you get it as it was at that time.

I failed to recall the name of another backup system which works like Apple’s time machine and which I used, but you’ve reminded me: rsnapshot

rsnapshot is a great set of scripts that uses links to minimise the file storage space needed to preserve the file tree at each snapshot. What’s nice is that it creates a tree in the filesystem with a copy of each snapshot in it, which means you can explore the tree with all the normal filesystem tools. Each time a snapshot is generated, only changed filles are copied to the tree, if a file is unchanged, a link is created instead, so each version of a file is only stored once.

Traktion · September 26, 2016, 6:27pm

ustulation:

Traktion:

with the meta data being updated after the final chunk has been uploaded,

Traktion:

Is this an option? Is it already being done or is it something which can be done without adding complexity to the client?

[Ignore:
DataMap, a part that belongs to the metadata, will be updated though for every change of file you publish to the network. If you want DataMap updates to not affect the metadata (which is stored in the parent dir) - the only way i can immediately think of is via the indirection : store is separately in SD (because SD’s are the only ones whose ID does not change even if data changes). That is what i mentioned here. And that will have all the problems in the post i tagged you.]

Edit: Ah shit - i missed the first part of the quoted sentence:

Traktion:

If we can update a single file, by updating multiple chunks simultaneously

Yes this we can do and is currently done that way - the DataMap is returned (and hence the metadata with the parent updated) only after the final chunk is put out. Though right now the chunks are put serially i think they can be made to go parallely - although it affects the precision of the progress bar the fontend wanted (i know i know - tell them ).

[Btw does anyone know how to do a strike through - instead of Ignore i wrote above.]

Ha! Right - saving the chunks sequentially to allow the UI to easily report progress sounds like the wrong solution to that particular problem, but I sense I am preaching to the choir there!

So, if we can multiplex uploads and get the progress correctly reported, we are on to a winner?

ustulation · September 26, 2016, 6:45pm

Should be doable (can’t promise until flushed out though - i am not a cheerful person, pessimist until done kind of a guy ). Thanks for all your opinions so far.

Fraser · September 27, 2016, 8:52am

Nothing much would change here!

I need to try and get time to put together an RFC for SD which would allow us to get rid of AppendableData and make it as flexible as possible without complicating it. I’m convinced this is do-able. That SD would keep previous versions (where the current one doesn’t) so long story short, we wouldn’t then need DirListing and hence DirMetadata::locator would just point to an SD representing an encrypted Dir rather than DirListing - but that’s not essential to change to use dir-versioning.

Fraser · September 27, 2016, 9:21am

I’m still failing to see how dir-versioning is more complex. I think I must be misunderstanding something here… this is how I see the flow for basic file ops under each approach:

File-Versioning:
To add a versioned file:

Create FileMetadata and store as new SD::data(Vec<FileMetadata>)
Add File to parent Dir::files and store new version of SD::data(encrypted(serialised(Dir)))

To modify a versioned file:

Update FileMetadata in parent Dir::files and store new version of SD::data(Vec<FileMetadata>)

To delete a versioned file:

Delete SD::data(Vec<FileMetadata>)
Remove File from parent Dir::files and store new version of SD::data(encrypted(serialised(Dir)))

Dir-Versioning:
To add a file:

Add FileMetadata to parent Dir::files and store new version of SD::data(Vec<encrypted(serialised(Dir))>)

To modify a file:

Update FileMetadata in parent Dir::files and store new version of SD::data(Vec<encrypted(serialised(Dir))>)

To delete a file:

Remove FileMetadata from parent Dir::files and store new version of SD::data(Vec<encrypted(serialised(Dir))>)

Edit: Forgot that SDs don’t handle versions again, so changed SD::data(encrypted(serialised(Dir))) to SD::data(Vec<encrypted(serialised(Dir))>) for dir-versioning list.

ustulation · September 27, 2016, 9:30am

Consider:
root/a/b/c in which root is at version 3 == root-v3. Now next snapshot is root/a/b. In your method (and i might be wrong here) what i believe will happen is version of root remains root-v3. You enter root, you get the latest version of a by default, and then latest version of b (which does not have c).

But if you read what ppl expect, they seem to want root-v3 progress to root-v4 with that operation. Now if you wanted to view root-v3 it should fetch (by default) a corresponding to that time, entering that, b corresponding to that time which contains c. So it’s like a whole backup-restore kind of operation. This is certainly more complex.

Fraser · September 27, 2016, 9:46am

There is no root-v4 though! Root didn’t change, so I think it would be pointless to add a new version with the same details as the previous. I guess you’re implying that every dir-listing in the entire tree should change with every snapshot (i.e. every time any dir is changed, they all are)?

For user to “see” the v4 snapshot, they have to recurse down the tree anyway, right? So, if we only update the specific dir affected rather than doing a full snapshot kinda thing, then to view this, the user would pull the root, parse it, find out about a, pull its listing, parse and so on… When they get to b, they find two versions, say b0 and b1. It’s up to the app how it wishes to display that info. If it’s only presenting the current state, it only parses b1 and shows dir b with no dir c. If it’s trying to present the state from a time point before b1 was added, it parses b0 and shows dirs b and c.

ustulation · September 27, 2016, 10:23am

yes that is also what i expected - but that’s not what people seem to think about it - i specifically asked if it was viewed as a entire tree snap-shot (like backup/restore) and the answer was yes.

@happybeing - would you be ok if it was like what fraser is saying ? (It’s not like a conventional snapshot tree backup/restore opreration - would you still prefer it ?)

@Fraser: Another point - with file only versioning, only one thing would grow - that particular file’s metadata vec.
So root/a/file-v0 then root/a/file-v1 would imply when you fetched SD::data(a) you would have fetched:

struct Dir {
    sub_dir: Vec<DirMetadata>,
    files: Vec<File>,
}
// where roughly
enum File {
    Versioned(Vec<FileMetadata>),
    Unversioned(FileMetadata),
}

So you can see, if a had 50 KiB of data (i.e. if sizeof(Dir-for-a) == 50 KiB) then new version of file would only cause a slight increment in size (previous size + size of new file’s metadata). Say the increase is 1 KiB. So now total size is 51 KiB. Assume next time it would be 52 KiB. But in dir versioning you would replicate the entire 50 KiB (as SD::data(a) is now SD::data(Vec<a>)) => new SD size ~ 101 KiB and probably kicked out of the SD. Next time 153 KiB etc. So in 3 updates for a file you would have reached 52 KiB in the file only versioning but 153 KiB in dir-versioning.

happybeing · September 27, 2016, 5:55pm

Yes I think it is still preferable, though not quite as much!

Clearly more work than my initial interpretation, but it still seems more intuitive to look for beginning at a directory level - because wet group files together that way for a reason.

If I understand it correctly, the same information can be reconstructed from versioning files - it does though seem like it could be more work (processing) to reconstruct the state of a given directory, and I think that’s a common use case and likely to remain so.

I am though uncertain about all this and interested to hear contra arguments because I’m very much in learning mode - and I love to learn. This is to me a fascinating discussion, and is helping me understand more about the SAFEnetwork and maybe one day, where some of those gold nuggets are hidden

ustulation · September 28, 2016, 3:33pm

Thread Summary:

We evaluated both file and directory versioning and each have their strengths and disadvantages currently. We need to factor in the way SAFE Network works into our discussions as well otherwise the practical outcome will be skewed. One drawback of file versioning is one wouldn’t be able to retrieve a deleted file, but this can be worked around by client apps storing a copy of it somewhere if this functionality is wanted, for e.g. they can move the deleted items to trash before deleting from directory listing for recovery of accidentally deleted items. From a different perspective however this feature of not being able to recover deleted files might be useful - maybe the user had sensitive information and wanted a permanent and secure deletion.
With Dir versioning the disadvantages are those of sizes. If a file is changed in the dir then entire directory is replicated with the new file causing more bloat. This is explained here. Similarly if there are many files which do not require versioning support (just a few or one requires it) then they all suffer due to entire dir being versioned. This size issue with Directory versioning can be countered with only keeping a diff of changes when a dir is mutated. This will however need to be looked into in a separate topic before concluding as feasible. The good however is that a deleted file can be recovered (which again comes with a caveat - if someone does want a truly permanent deletion it wouldn’t be (at least not easily) possible).

This also ties in with self-encryption. Since ImmutableData is capped at 1 MiB currently, we want S.E to reflect that by giving us a DataMap by way of recursive s.e which is <= current cap for ImmutableData. This code is presently in safe_core and is currently planned to be moved to S.E.

Ultimately we will also discuss more on whether deletable-immutable-data might help here, reducing unwanted data from the Network. The RFC is already there for a starting point.

With all this we concluded that for now we will go with File-versioning approach (which suffices a lot of use cases already) and keep looking for any solution that can be better overall as our network matures and we know more of what might pan out and what might not.