Public database questions

This will be a very non technical post but please bare with me.

Here is the theoretical situation.
•The first user uploads a list of files to app (pays for PUTS)
•The app will store these files in its database
•The app scans the second users files before upload, against its DB to see if the files already exist to avoid the PUT cost of like files while still uploading novel files to DB (and on and on)
•When a new user has a file that already exists in the DB they are given access to the file without having to upload.

Some questions
•The original uploader should have permission to the file to edit the name if they like (not the file just the file name)
•Users who were given access to a like file should also have permission to edit the files name too :grimacing:
•What kind of permission would the app need to achieve this?
•Can these files be shared between users?
•Upon initial upload should these files be public?? (Currently they are stored in a container that is encrypted by default)
•Obviously this a mutable data type but by the sounds of it, shared mutable data is more for sharing one users data between multiple apps not multiple users
•If one user changes the name of the file is that reflected across all users? (Not good) assuming the name is uploaded in the same MD as the data. I need to dig a bit and find out if this is the case in my situation.

Not trying to make others do the hard work for me I’m just out of my depth a bit here. I am currently reading but either way I think these answers could be helpful to more than just me. If this all works out I think it has pretty big implications not just for the use case I have in mind but for so so many more.

No need.

The APP can simply do the self encrypt and as it does so it requests each chunk. That will tell the APP (and you) whether the file and/or first part of it already exists.

The chunks are the database.

Self encrypt happens client side by the client software not the app as far as I know. Immutable data is the only data type that has dedup correct? These would be MD’s. So what part of my post are you addressing?

If you have any questions to help me specify for you please do ask.

Yes the DB would be comprised of the chunks and their locations (data maps) but can they be shared between a endless number of users etc?

But an APP can use the library to run its own self encrypt.

Correct for obvious reasons.

For file storage. For MD it is pointless since you store data at a specific address for reasons beyond any duplication issues.

1 Like

The locations is a simple function of the hash of the chunk. So if you have two exactly the same chunks then they will always reside at the one address. So no matter how the chunk was made if its the same then it has the same location.

So for instance 2 files have the first 100MBytes exactly the same then the first 98 chunks will be exactly the same and each of those first 98 chunks of each file have the same location

Thus the network chunk storage is the DataBase. You simply do the self encryption and when you get a hit then no need to upload the chunk and simply add it to the data map.

There will be a lot of files where the first portion of the file is identical with other files. For instance director’s cut vs a viewer’s cut. The viewer’s cut might have the first 1/2 exactly the same as the director’s cut, but then is different thereafter.

2 Likes

@rob Effing brilliant. Okay so as you can probably assume this approach is to act as a dedup for MD to reduce user costs. Using the self encryption library is a fantastic idea, thank you there. If I’m understanding this right, the chunks exist independent of the whole file or it’s name so multiple users can have access to say, all the same chunks of a file but still name it something different.

Honestly I think I over complicated it. So a user uploads a file.
Second user uploads same file and though they may mutate the file later and incur a PUT for such, the initial upload of the second user was void of PUT cost as the chunks all matched. Am I on track here?

Also, is this at all possible to avoid PUT cost? Even with deduped immutable data I thought a user still paid the PUT cost even though the data already exists. I’m assuming this logic would apply for MD as well. So though uploading might be avoided if files were scanned and matched before upload could we still save users money?

I believe this approach would require self encryption before scanning natively to match with self encrypted files that were already uploaded as well. I wonder how that would pan out time wise, certainly could be justified if saving bandwidth, network traffic, and money.

typo??

Yes

Only the datamap links them together

Yes

For immutable data (file storage) this means that new chunks are written to replace the changed part of the file and the datamap is updated.

Yes it is possible to avoid “PUT” costs if the exact same (byte for byte) file contents exists already. Do your own self encryption and check each chunk. Only store the chunk if the chunk does not exist. Build your own data map as you do this process.

You cannot really do this for MD as each MD has a purpose. To store data specifically for an application. So while to MDs at different addresses may contain the same information they are there for different APPs/purposes and to try and “dedup” today may see it all go wrong when the orig owner changes the MD. Then your data changes without you wanting it to.

Now comes the crunch

problems

  • You need to ask the network to check each chunk. Increasing time required as the file size increases
  • For even one chunk you increase the average time for the APP to store since duplications would not be the normal situation for most users of the APP. Wait for check to pass/fail then do store.

I see the situation being along the lines where people who compress their files can save upload times, it still is rare for general public to compress their files before uploading to a filesharing platform (dropbox, faceless, google etc).

My opinion and guessmation is that Average Joe is not going to waste time when uploading files of only a few chunks which only costs a small fraction of a cent. I doubt many would even if they are familiar with such an APP due to the increased time to use many APPs.

My opinion and guessmation is that even for movies and other files >50MB is that the time required to do all this testing is going to exceed the desire to “just get it done” and pay the fractions to couple of cents for large to very large files. The issue is magnified when the user has 100 or more such files to upload, the increased time could be a real pain and become an annoyance.

My opinion - The tightwads will likely do it no matter if it makes them late for work some days or the spouse goes to sleep while waiting. The wise will only do it for some large files when they have the time to wait.

tl;dr

In summary I feel that the overall effect on the network is that there will only be a small portion (maybe even less than 1%) of large files and very few if any small files will utilise this method to save some “PUT” costs.

And that Dedup will still be effective and the benefits to the network will still outweigh any extra loads the APP may cause.

The decider will be affordability of PUT cost.

Would you do this to save a penny /cent per megabyte for example? A tenth of a penny/cent? Etc.

3 Likes

Good point gentlemen. I’m not sure if the effort put in could justify the results. If storage is that cheap then I would also argue no though you’re correct in assuming that in general what is being stored is a large library of files, so more than average. That is why I thought it could perhaps be a worthy cause. Although the time it takes to upload this many files is long and I’ve done it before so if there was a simple and fast way to do this (nothing is ever that easy :sweat_smile: ) then I thought it should be considered. Thank you for the discussion!

Search in the community forum, there has been discussions in this area before. I believe tfa suggested there that the network could respond if the chunk exists without requiring it to be downloaded.

If this is indeed possible then the time to wait to see if a chunk exists is reduced, but is still going to be a significant factor in the determination of the worth of such an exercise.

1 Like

This topic was automatically closed after 60 days. New replies are no longer allowed.