Proposed changes in NFS

Fraser · September 26, 2016, 10:44am

You only had to give me about another 10 seconds to post my reply! Are you watching me type so you can ninja me?

ustulation · September 26, 2016, 10:59am

We moved 1430 miles in those 10 sec through the universe - no joke man

Ok i just check google-drive (merely due to convenience - feel free to point out other e.g’s) - it doesn’t seem to work the way you say.

A version of a directory shows the entire snapshot of the sub-tree at that point of time. Even if deep down something was removed etc.
(This i already mentioned though) - you can’t seem restore something you deleted permanently (it goes into bin first and then delete it from there).

My point being that - what we think as versioning may not be what ppl expect. Just go with the simplest case and let others build on top of it if required.

ustulation · September 26, 2016, 11:25am

For this you are ok with the additional cost of 1 SD per file (for indirection) right (to buy you the ability for simultaneous updates to different files in same folder {as opposed to bulk updates which will be possible without it too}). @DavidMtl @Traktion (since you both like the post - i take it as an agreement to it )

Those are not big data, so most of 100 KiB (and maybe more if expanded) space of SD will be wasted in that.
Will lead to a slightly increased complication if you see sharing - you’ll have to travel all the file indirections to structured-data and modify those to include owners as well. So previously if there were root/a/b with each have 1000 files under them, you would need to add owners to only 3 SD’s. With the above change you would require to add/remove/manipulate owners from 3003 SD’s.

Maybe more i haven’t noticed.

But is everyone ok with that ^^^ ?

Fraser · September 26, 2016, 11:25am

I guess I’m thinking of Git, Subversion, Apple’s Time Machine, MS’s System Restore for example.

So, to be clear, say with your example using root, dir-a and dir-b, currently we only have root and dir-a, but yesterday we also had dir-b. If we view the current tree, we see only root and dir-a - there’s no mention of dir-b. If we view a snapshot from yesterday, we see all three dirs. Is that what you’re saying?

If so, that’s what I’m saying would happen with dir versioning too.

Right - but for the other examples you can restore deleted items.

Agreed. (We’re probably both wrong!)

I don’t agree that file versioning is simpler than dir versioning; I think they’re comparable in complexity. However I’m somewhat convinced that dir versioning could be cheaper for the users and easier for the network to handle.

Fraser · September 26, 2016, 11:35am

I’m not

Obviously my main concern is what we’re debating in the concurrent discussion (where I prefer dir versioning to avoid the cost of 3003 SDs), but other than that, I think it would be worthwhile adding a Vec<u8> to FurtherInfo too so that if the user has opaque data which needs to be changed when the file’s contents change the user isn’t forced to update the FileMetadata.

ustulation · September 26, 2016, 12:31pm

Me neither. It is more complicated and decreases performance - but waiting for @DavidMtl and @Traktion 's opinions.

So should this not discourage us from providing dir-versioning ? File versioning can be interpreted only in one way while for dir there seems to be divergence.

Can you give a modification of types if we were to go the dir-versioning way ?

Quick modification of the structure in O.P link would look like this:

/// `StructuredData::data(encrypted(serialised(DirListing)));`
/// where:
// =================================
enum DirListing {
    Versioned(Vec<Dir>),
    Unversioned(Dir),
}

struct Dir {
    sub_dirs: Vec<DirMetadata>,
    files: Vec<FileMetadata>,
}

// =================================
// If shared ppl can update this independently of each other, attaching it to any existing tree they have.
struct DirMetadata {
    locator: DataIdentifier, // DataIdentifier::Structured(UNVERSIONED_SD_TYPE_TAG, XorName)
    encrypt_key: Option<secretbox::Key>,
    name: String,
    created: Time,
    modified: Time,
    user_metadata: Vec<u8>,
}

// =================================
struct FileMetadata {
    name: String,
    size: u64,
    created: Time,
    modified: Time,
    user_metadata: Vec<u8>,
    data_map: DataMap,
}

You can take it from there ^^^ if you want.

DavidMtl · September 26, 2016, 1:58pm

(Emphasis mine)

Honestly I’m divided now.

I like splitting the metadata because it allows multiple people to modify, simultaneously, different files located in the same directory. But they won’t be able to create or delete multiple files, so it’s only half of the problem that it fixes.

I don’t like the extra cost in SDs. By taking this route we just multiplied by 2 the amount of addressable data stored on Safe. Is it a problem? I don’t know, by I’d like the network to stay as lean as possible.

If bulk updates are possible by using the low level API that might work out ok. So I would be manipulating SDs directly. For examples update the content of all files first and then modify the list of files in the parent-directory manually. Though this means that this operation is not atomic which might be a problem if two person is trying to do it at the same time. Maybe adding a locking system would be good but then someone could keep it locked by accident and you would have to create a way to handle that too.

Maybe what is needed is to keep in mind that Safe is a new beast and trying to do things the way we are used to might not be the best approach, like trying to fit a square peg inside a round hole. Maybe we need more experience to really grasp its natural limitation and not try to fix them by adding complexity but instead learn to work around them.

I vote for that. Keep it simple, give us tools to manipulate the lowest level of the API so we can figure out how to do what we need. After all, I could roll my own file system by handling SDs directly if I don’t like what NFS is doing.

ustulation · September 26, 2016, 3:17pm

This ^^^ - there are so many things we continue to un-learn.[quote=“DavidMtl, post:36, topic:134”]
I vote for that. Keep it simple, give us tools to manipulate the lowest level of the API so we can figure out how to do what we need. After all, I could roll my own file system by handling SDs directly if I don’t like what NFS is doing.
[/quote]

Thanks for your opinions (everybody else too) - That’s my vote too and i think it is a majority right now. Continue to critically evaluate (if you have anything you want to say).

ustulation · September 26, 2016, 3:34pm

Yes in the back of my mind i believe that low-level is the way to go - it is almost impossible to design high-level for every use case out there - it has never worked in my experience. I think the best way is to give good tools (make low-level powerful) and let ppl make apps using them rather than compound operations or provide specialised apis, because the permutations are infinite and one wouldn’t know what is the user-base of one’s specialised api - all you know you catered to 2 devs in the world.

With this i would vote on keeping the so called high-level in form of nfs as simple as possible. Would love to hear thoughts on this too.

Traktion · September 26, 2016, 3:58pm

ustulation:

For this you are ok with the additional cost of 1 SD per file (for indirection) right (to buy you the ability for simultaneous updates to different files in same folder {as opposed to bulk updates which will be possible without it too}). @DavidMtl @Traktion (since you both like the post - i take it as an agreement to it )

Those are not big data, so most of 100 KiB (and maybe more if expanded) space of SD will be wasted in that.

Will lead to a slightly increased complication if you see sharing - you’ll have to travel all the file indirections to structured-data and modify those to include owners as well. So previously if there were root/a/b with each have 1000 files under them, you would need to add owners to only 3 SD’s. With the above change you would require to add/remove/manipulate owners from 3003 SD’s.

Maybe more i haven’t noticed.

But is everyone ok with that ^^^ ?

Hmm… it doesn’t sound so good with the cons! As @DavidMtl suggested, this will not help with additions too.

Maybe the pros don’t outweigh the cons and a better solution is needed?

Traktion · September 26, 2016, 4:15pm

Perhaps we are trying to get parallel performance in the wrong way here.

The whole point of updating multiple files at once is to increase throughput. That is, we use multiple connections to fill available bandwidth, because a single connection cannot fill it. This doesn’t result in any single file uploading more quickly, but throughput increases. However, what we really want is faster single connections, to use all available bandwidth and have faster uploads.

This makes me think that multiplexing over a single connection is actually more desirable. If we can update a single file, by updating multiple chunks simultaneously, with the meta data being updated after the final chunk has been uploaded, we achieve high performance and avoid the mutation errors when multiple files are changed simultaneously.

This may not help so much if there are lots of small files being uploaded, but it would definitely help with the larger ones.

Is this an option? Is it already being done or is it something which can be done without adding complexity to the client?

ustulation · September 26, 2016, 4:23pm

[Ignore:
DataMap, a part that belongs to the metadata, will be updated though for every change of file you publish to the network. If you want DataMap updates to not affect the metadata (which is stored in the parent dir) - the only way i can immediately think of is via the indirection : store is separately in SD (because SD’s are the only ones whose ID does not change even if data changes). That is what i mentioned here. And that will have all the problems in the post i tagged you.]

Edit: Ah shit - i missed the first part of the quoted sentence:

Yes this we can do and is currently done that way - the DataMap is returned (and hence the metadata with the parent updated) only after the final chunk is put out. Though right now the chunks are put serially i think they can be made to go parallely - although it affects the precision of the progress bar the fontend wanted (i know i know - tell them ).

[Btw does anyone know how to do a strike through - instead of Ignore i wrote above.]

happybeing · September 26, 2016, 4:54pm

I don’t feel knowledgeable enough to make sensible judgements about implementation issues here, so my thoughts are fairly high level and more user than builder or implementor oriented.

As a user, my preference was always 3, but until @Fraser intervened, it seemed pointless to advocate that based on what was said in the OP, because I couldn’t address the objections made against it.

As a user, I’ve never used Google Docs to a significant extent, so I think of filesystems in terms of Windows & Linux, particularly backup/restore (as in the kind of applications Fraser mentioned, and similar) and assume (perhaps incorrectly) that those cover most common user expectations and use cases well.

For anything else, of course providing the low level tools is very important.

So with the caveats about my input I’ve given, my preference is for the NFS level functionality include the ability to really roll back to version based on directory level, including deleted files. That makes it easy to restore whole trees, and even to find a particular version of a file at a given date even though the file versions aren’t indexed separately (but that would then be fairly easy to add in an App).

In this I don’t have much sense of performance issues, and I do agree that thinking differently with this new, er [hesitates,] paradigm [ducks] is both where the gold lies, and bloody difficult.

ustulation · September 26, 2016, 5:07pm

Fair enough - i have asked @Fraser to further elaborate on it to view merits demerits better and to understand what he exactly views as directory versioning and version restoration.

I’ll count this as a vote then for point-3.

ustulation · September 26, 2016, 5:17pm

To probe further - if you have root/a/b/c (snapshot-0), then root/a/b (snapshot-1), and so on. What do you mean by backup/restore ? Should the travel from snapshot-0 to snapshot-1 be reflective in root ? Something like root-v0 and root-v1 so when you say root-v0, you are no longer prompted which version of a once you enter it, and then not prompted which b etc. All that will happen is if you chose root-v1 you would expect is - i enter it, find a, enter it find b enter it find nothing <<< Would this be your expectation ?

happybeing · September 26, 2016, 5:42pm

If I understand you correctly, yes.

So going to a given version of root is unambiguous - you get it as it was at that time.

I failed to recall the name of another backup system which works like Apple’s time machine and which I used, but you’ve reminded me: rsnapshot

rsnapshot is a great set of scripts that uses links to minimise the file storage space needed to preserve the file tree at each snapshot. What’s nice is that it creates a tree in the filesystem with a copy of each snapshot in it, which means you can explore the tree with all the normal filesystem tools. Each time a snapshot is generated, only changed filles are copied to the tree, if a file is unchanged, a link is created instead, so each version of a file is only stored once.

Traktion · September 26, 2016, 6:27pm

ustulation:

Traktion:

with the meta data being updated after the final chunk has been uploaded,

Traktion:

Is this an option? Is it already being done or is it something which can be done without adding complexity to the client?

[Ignore:
DataMap, a part that belongs to the metadata, will be updated though for every change of file you publish to the network. If you want DataMap updates to not affect the metadata (which is stored in the parent dir) - the only way i can immediately think of is via the indirection : store is separately in SD (because SD’s are the only ones whose ID does not change even if data changes). That is what i mentioned here. And that will have all the problems in the post i tagged you.]

Edit: Ah shit - i missed the first part of the quoted sentence:

Traktion:

If we can update a single file, by updating multiple chunks simultaneously

Yes this we can do and is currently done that way - the DataMap is returned (and hence the metadata with the parent updated) only after the final chunk is put out. Though right now the chunks are put serially i think they can be made to go parallely - although it affects the precision of the progress bar the fontend wanted (i know i know - tell them ).

[Btw does anyone know how to do a strike through - instead of Ignore i wrote above.]

Ha! Right - saving the chunks sequentially to allow the UI to easily report progress sounds like the wrong solution to that particular problem, but I sense I am preaching to the choir there!

So, if we can multiplex uploads and get the progress correctly reported, we are on to a winner?

ustulation · September 26, 2016, 6:45pm

Should be doable (can’t promise until flushed out though - i am not a cheerful person, pessimist until done kind of a guy ). Thanks for all your opinions so far.

Fraser · September 27, 2016, 8:52am

Nothing much would change here!

I need to try and get time to put together an RFC for SD which would allow us to get rid of AppendableData and make it as flexible as possible without complicating it. I’m convinced this is do-able. That SD would keep previous versions (where the current one doesn’t) so long story short, we wouldn’t then need DirListing and hence DirMetadata::locator would just point to an SD representing an encrypted Dir rather than DirListing - but that’s not essential to change to use dir-versioning.

Fraser · September 27, 2016, 9:21am

I’m still failing to see how dir-versioning is more complex. I think I must be misunderstanding something here… this is how I see the flow for basic file ops under each approach:

File-Versioning:
To add a versioned file:

Create FileMetadata and store as new SD::data(Vec<FileMetadata>)
Add File to parent Dir::files and store new version of SD::data(encrypted(serialised(Dir)))

To modify a versioned file:

Update FileMetadata in parent Dir::files and store new version of SD::data(Vec<FileMetadata>)

To delete a versioned file:

Delete SD::data(Vec<FileMetadata>)
Remove File from parent Dir::files and store new version of SD::data(encrypted(serialised(Dir)))

Dir-Versioning:
To add a file:

Add FileMetadata to parent Dir::files and store new version of SD::data(Vec<encrypted(serialised(Dir))>)

To modify a file:

Update FileMetadata in parent Dir::files and store new version of SD::data(Vec<encrypted(serialised(Dir))>)

To delete a file:

Remove FileMetadata from parent Dir::files and store new version of SD::data(Vec<encrypted(serialised(Dir))>)

Edit: Forgot that SDs don’t handle versions again, so changed SD::data(encrypted(serialised(Dir))) to SD::data(Vec<encrypted(serialised(Dir))>) for dir-versioning list.