File UID as sha256sum ?

davidpbrown · November 22, 2019, 6:04pm

Does the network already hold some unique checksum on a file?..
and this not a unique compound of [file and owner or location] but just what the file is; so, the equivalent of what sha256sum achieves.

While it is not appropriate for the network to have any opinion, it is appropriate for the network to empower users and a one-off checksum available, could have a number of benefits relative to users knowing what is available on the network.

? One benefit from having sha256sums would be to more easily spawn discovery services offering users real choice about content relative to their interest; that then without duplicate effort of services both downloading and then calculating sums. Whether those services are for public services or just some business doing its own indexing, this I wonder would prove much faster - and reduce GETs and hassle of databases the detail. The opinion on what the file is, not for network but that it exists, is for network.

? Another benefit, if a file is known to exist, is that there might be no need to upload again, at cost be that money or in time/effort/energy… some dependency perhaps on who owns the file, if there’s a risk of its being removed but depends on a users interest… if some shared domain has the file or where the file is known to be permanent and public.

? Another benefit to the network perhaps that it would reduce GETs for that interest of just indexing/servicing files, if GET-file-attribute is minor action. Equally a check on if a file has changed since last pass would be faster… noting this as I wonder at the volume of traditional internet traffic that is just the indexing of Google… let alone all the other search-engines and bots.

and for clarity I’m not suggesting the network holds any list of these, rather than it is sat as attribute of file, discoverable for those aware and with access to the file.

Simple idea then offset by that perhaps its trivial to implement that difference…

This obviously would require an untamperable network confirmed stamp at creation, rather than just another r/w field.

If the network does what it can to enable users to do what is for them to do, then that in contrast to normal traditional offerings that try to maximize the effort and cost that must be for users to absorb.

bzee · November 22, 2019, 8:41pm

I think what you’re hinting at is actually one of the corner stones of the SAFE Network: Self Encryption and Immutable Data. A file is split into chunks. The address of such a chunk is determined by the hash of the contents. These addresses are put into a map. So, the second time the same file is uploaded it will not result in duplicate data.

davidpbrown · November 22, 2019, 9:11pm

Yes but is it visible as a simple hash… as a file attribute?.. it’s part of the process … but accessible… and as some standard like sha256sum??

It might be an obvious question… I don’t know.

joshuef · November 25, 2019, 8:01am

@davidpbrown I think what you’re describing is the XorUrl system. Each URL being unique to a piece of data, and calculated as the xor address of the data-map that is the product of self-encyrption (along with some other data).

Thus, the same data re-uploaded will have the same self-encryption result and map, stored in the same place on the network (ie. de-duped). So this XorUrl can be used to the same effect as a calculated hash.

happybeing · November 25, 2019, 10:37am

I think there’s some subtly here. @davidpbrown seems to be taking about the hash of the whole file.

I’m not sure, but don’t think that is available for files on SAFE, but the data itself is hashed and the file guaranteed to be correct based on these.

The datamap is hashed, and is the ‘key’ to assembling a file from the chunks - I think the map contains the hashes of the chunks. So the data map is verified against its hash, and each chunk is verified against the chunk hash.

It’s that correct?

Does that help David? If you really wanted hashes for the whole file you could calculate this yourself and store them explicitly, for example as metadata along with the xor address of the file or in a separate location etc.

bzee · November 25, 2019, 10:56am

I think for the benefits @davidpbrown names, the Data Map serves the same purpose as a file hash. They are both deterministic algorithms that result in a unique identifier per file. So, as @joshuef suggests, the XOR URL seems like what is being described here. Or at least it fulfills the benefits named.

davidpbrown · November 25, 2019, 5:45pm

Yes…

I’m not looking for a 1:1 hash of a file on the network; talk of Data Maps and XorURLs I wonder is about the compound of [owner.file].

I’m looking for a 1:1 hash of the file… and one that the network suggests confirmed.

but that’s a bit traditional…
=> Someone downloads the entire content of the network, at a cost of the time that takes; hash the files; index them perhaps and provide a service. They become Google and dominate everyone’s perception.

I’m wondering is about wisdom of the crowd - a fluid perception that sums to something significant.

With something simple like the standard sha256sum, alsorts of services could arise quickly and easily that cater for filtering and discovery. The benefit is that anyone could contribute an opinion about files - detail or category or whatever attribute. Those made available could fuel services that put a GUI or filter controller in place to enable the user.

If many people I subscribe to and trust, have opinion about a couple of files
4f41a82c20807a233431f7bea14e01eb99d63c2d713790393ccca46518d5ca58
a2a96f7b1483d24e22e0efa0e3b374c1ec618a8a2bec15a6bbaaf92c1b0e02c2
then the sum of those opinions is likely valid.

Without downloading, I could resolve the first is a cat; the second is a tiger and for a fear of tigers, I avoid that one.

If those are suggested as an answer to a query, then they could be tiny they could be huge files… but the data point is the same size.

51.8 percent of all Internet traffic comes from bots, while only 48.2 percent of internet traffic comes from humans

Premise is that people need to know what is available; and they need a route to discovery. They want option for that to be relevant to their interests; and potentially to exclude what is known not interest.

With SAFE, the network is trusted; so, the network can confirm a sha256sum
Traditional internet does not have option for the network to confirm anything.

My reaction to this half-baked idea was that it’s simple; it’s fast; potentially is inherently part of what is done to upload anyway - so noting it as file attribute is trivial; and could be better than traditional solutions where we now expect one monster to do all the legwork… and necessarily one of two big solutions do dominate and control access and perception and ranking and rating…

Also, if you were trying to provide a more traditional indexing of content then that becomes a lot faster, if you are not downloading what is known but just noting where it is available.

XorURL listing would be about proof of existence of single instances, in a more exacting way.
sha256sum as file attribute, would be about linking into other knowledge with that hash as a UID reference point.

To be expected perceptions would evolve but that’s aside and easily manageable, perhaps group subscriptions that equally are fluid and trusted or not.

Celebrity opinions better than the local Joe I know?..

happybeing · November 25, 2019, 6:03pm

I’m not sure I understand what you are thinking so here’s a question. If the xor address of a file effectively guarantees the way to access the file and that the content has not changed, can you use that instead of a hash - if not, why not?

davidpbrown · November 25, 2019, 6:16pm

Because, I’m expecting, there will be many xor addresses for one actual file that has been uploaded many times.

The draw on data about what is known for one file, needs a UID about the file… otherwise you are limited to that one xor address.

Unless I’m confused about xor address… but I’m thinking that is unique relative to an owner.

Many people might upload
4f41a82c20807a233431f7bea14e01eb99d63c2d713790393ccca46518d5ca58
but everyone should be able to quickly see that every instance is the same actual file.

In reality there might be little difference… but I’m just thinking that indexing images and video, becomes a lot simpler than invoking some AI recognition, where instead many people can sum opinions on what the image is… a back to front captcha. Otherwise you have to interpret every file, rather than recognising a file that you know.

joshuef · November 25, 2019, 6:21pm

There won’t be.

The XorUrl for a published file (immutable data) is completely deterministic on its content. If i upload file x, and I do, we both get the same xorurl. We’ll both pay for the upload. but the content will be deduplicated.

Everything you’re looking for above can be achieved via the XorUrl, with the added bonus that it’s instantly navigable.

davidpbrown · November 25, 2019, 6:25pm

Great.

So, do XorUrls exist only for files that have been uploaded?.. or is there tool that could guess what they would be?

happybeing · November 25, 2019, 6:28pm

No, the xor address is unique to the file because it is effectively the hash of the datamap, which is unique to the file. I think this is the source of confusion!

If I upload a file and you upload the same file, your upload doesn’t actually create anything new - the datamap and the file are immutable and so can be shared. If you upload a new version of the file, that has a different xor address (and again will be the same for anyone uploading the same version of the file).

joshuef · November 28, 2019, 9:47am

At the moment, not. Though I think in the end the CLI/API w/ a --dry-run option would serve for this

bochaco · November 28, 2019, 2:24pm

yeah this is a good idea, we cannot do it right now as the SCL doesn’t allow for a dry run, but I think it’s not too difficult to have SCL to support a dry run which can return just the datamap hash without uploading the chunks. CLI --dry-run now shows an empty URL so this will make perfect sense as the way to get a xorurl from a file without uploading it, which could help apps in the future to save costs when uploading files

mav · December 2, 2019, 10:37pm

Yes it would serve, but it doesn’t really express the intention. Wouldn’t it be better and more easily remembered / communicated / documented with something like

safe xorurl /path/to/file

This also helps with discovery when calling safe --help rather than having to dig all the way to safe files put --help where you need to spend the mental effort learning the --dry-run flag is going to have the desired side effect of printing the xorurl. Nobody will go to that effort. They will use google. Then forget it tomorrow and use google again.

Maybe this functionality doesn’t deserve a ‘top level’ command of it’s own but imo it does since every client should be checking if the file exists before doing safe files put. And how to check? You need the xorurl of the local file first. So this will be needed a lot I think.

Another option may be a safe ls or safe tree command (same as unix commands) which take a location as either a local path (ie not starting with safe://) or a network xorurl path (ie starting with safe://, eg root directory would use location safe ls safe:///). This could output the xorurls for either local or remote files. But this has similar issues with discovery and having to ‘read the flags documentation’ to discover the desired side effect.

danda · December 4, 2019, 6:23am

+1 for “safe ls” and ‘safe tree’ or for consistency perhaps they should be “safe files ls” and “safe files tree”.

I was surprised the first time I did a ‘safe files put’ and did not see a corresponding ls. Then I rationalized that ‘safe dog’ on a container is more or less an ‘ls’. Except that, how does one list all the containers?

But with safe ls, does dog still serve a purpose?

Or said another way, it seems like safe ls is really just safe dog with no arguments. Personally I’d rather see it renamed to ls for consistency with unix. Or it could be “safe files list” to be most obvious for those not used to unix commands.

A version of the du command would be handy also to find how much disk space a tree uses. (maybe safe files diskusage)

just thinking out loud here…

davidpbrown · December 4, 2019, 10:36pm

Just to note my thinking on this is from the direction of what gifts power to the users… which is a query that needs a lot more thought for a network that will allow anything.

The network must have no opinion but users might want tools that acknowledge their own; and we need to acknowledge the half of bell curve that thinks differently; cater for all users; and not fear that they will not learn the difference over time - opinions change.

One route to empowering choice then, is noting others opinions - the sum of other opinions is a powerful tool… be they individuals or groups as organizations.

I wonder the arguement for safe xorurl /path/to/file is strong enough above but becomes powerful as a tool against those who challenge at any point future, that SAFE holds alsorts of undesirable content. The response to that challenge can be: “Define: undesirable - offer up your opinion, and users can take note of that”. That response I wonder could be fair to individuals or nation states even… so, you have a problem as the UK state or as China - part or whole, then offer an opinion that people can subscribe to.

So, with an offline tool their would be no contention that resolving the hash requires upload of content that is distasteful. An authority - taking the extreme example of a State, could cast opinion without making what they considered error by uploading what they disagree with, they just note the xorurls that are their opinion.

So, a user might opt-in to base opinion from defined group or individuals they trust and the sum of those defend the perimeter of the visibility of content. Normal internet cannot lend a hand and requires 3rd party services; SAFE could potentially have browsers that more easily acknowledge opinion and whitelist content. Something analogous to virus checker.

There is a bit of extra work if xorurl is needed rather than a common standard… but why not xorurl displaces sha256sum. The counter argument perhaps is that new tech should not presume the future but equally its fair that SAFE is not adding extra complexity, where a fair alternate works well for the actual way that the network works.

The only downside to this thought I wonder was that a certain browser might be required for all users but fallback is that time beats all enemies.

tdlr; safe xorurl /path/to/file solves alsorts of problems

bochaco · December 6, 2019, 4:05pm

The dog is to give more information around metadata and resolution of URLs and links rather than about the content itself. The cat is currently giving you something close to an ls when your URL targets a FilesContainer location. So perhaps it’s time to start making a more clear and marked distinction among these commands as you all seem to be suggesting, how about we go this path:

safe cat: simply shows the content found after the the URL was fully resolved, i.e. if it’s an ImmD then the data printed out on stdout, if it’s a FilesContainer or Wallet it shows the content in a table as it does it now. So just like it works now.
safe dog: show details and information about the resolution process applied on the URL provided, showing which NRS was resolved (if any), any path resolved if the URL contained a path and it went through some FilesContainer, and finally showing metadata about the targeted content, like data type, xorname, etc. (this is basically what the dog was intended to be)
safe files ls: creates an abstraction on the FilesContainers exposing the same options/flags as a traditional ls command would do on any file system, allowing to see which are files and which are subfolders, and also resolving paths and showing the content of only that path and not the entire FilesContainer (note this is already implemented in the cat command, so we can replicate that behaviour or move it and only make it possible with files ls).
safe files tree: a nicer/different pretty print of the safe files ls, or simply a flag of the ls: safe files ls --tree <path>
safe xorurl: as an alias to do a dry-run on files

davidpbrown · December 6, 2019, 5:33pm

safe files ls
safe files tree
just seems like an OCD trigger!

Surely those should be
safe ls files
safe tree files

danda · December 8, 2019, 1:27am

@davidpbrown to me it makes sense to group all the files related commands together under safe files. Presently cat and dog are exceptions to this… I’m not sure what the rational for that is, maybe @bochacho can tell us.

Since file commands are so commonly used, it could be that at install time, safe files gets aliased to something like sf. So then the actual command typed would be sf ls. just a thought.