XOR-URLs checksum

bochaco · May 24, 2021, 3:36pm

I’ve created a very basic implementation of a checksum for our XOR-URLs. I’m highlighting it here to hopefully get some more eyes on it. I’ve implemented a very basic one, kinda simplified version of btc addresses checksum mechanism, but I’d like to hear from anyone who has ideas or suggestions about this:

Since we’d be using XOR-URLs also for SafeKeys/Wallets, etc., then having a checksum sounds to be very much needed.

dirvine · May 24, 2021, 3:45pm

Looks fine, but the hash of the message itself is the ultimate checksum. Is this to reduce the size by just taking the first 4 bytes? If so then we may have collisions but here it may not be so important. A collision btw would be a bad message passing the checksum as the first 4 bytes of the hash matches.

happybeing · May 24, 2021, 3:50pm

Not looked into it, but a four byte (32 bits) checksum seems overkill for a 256 bit address, or whatever the number of bits an XOR url decodes too (I’m not sure how much metadata it includes if any).

bochaco · May 24, 2021, 3:56pm

Yes, these 4 bytes are appended to the end of the xorurl. Btc addresses do this, they first hash the public key twice (sha25 + ripemd160), that’s called the public key hash, on top of that one they hash it twice and get the first 4 bytes to become the checksum, which it’s added to the pk hash string, and that becomes the btc address.

This is exactly what I wasn’t sure of, what would be a good number of bytes for it, the xorurl would be up to 44 bytes at this point: sn_url/src/lib.rs at master · maidsafe/sn_url · GitHub

happybeing · May 24, 2021, 4:08pm

I’m not sure as it seems like something requiring expertise! My thought process, which might not be the way to work this out, would be to ask what it the intent here. For example:

what are the likely sources of error: type (e.g. flipped bit, mistyped character, repeated character, missed character) and their frequency?
how important is it to prevent the situation of a corrupt URL that passes checksum, or perhaps to decide what probability of this happening is acceptable?

You could then try to work out how many bytes are needed for any particular calculation to achieve this based on a range of error type/frequency scenarios.

I’m not sure how easy it is to work that out or how reliable such calculations are. It feels a bit like rolling your own cryptography :crazy:, so depending on how critical the goal is it may be better to just look for techniques and guidance on how effective they are, and find one which achieves the aim.

I guess you know all this though

dirvine · May 24, 2021, 11:52pm

4Bytes seems fine. I wonder though

    // XOR-URL encoding format (var length from 36 to 44 bytes):
    // 1 byte for encoding version
    // 2 bytes for content type (enough to start including some MIME types also)
    // 1 byte for SAFE native data type
    // 32 bytes for XoR Name
    // and up to 8 bytes for type_tag

If the type tag was not there and instead the content type covers it all ? Just thinking of optimising this.

The other discussion is type tags or usable setting tags. I wonder if somehow this is gonna get corrupted and perhaps we should set them in code? The issue is we can not check the tag type so perhaps it needs more discussion there?

I have a feeling content type may do it, or perhaps some kind of register a tag with a specific format that cannot be changed or collide. That would require network agreement though and that’s an interesting concept.

Perhaps a global Register Tag, in a merkle register (at address 000000000…)? So all tags are linked with a DSL/struct that must validate? Users pay X SNT to register such types (tiny amount though)? Some ideas (I am arguing with myself here and going for type tag)

All sections must store this register and all registrations are against a Dev key
The types can be used for PtD and perhaps also PtP ? (easy to calculate these in this way)
Registering a new type will require the dev write this to every section (this is something we need a pattern for)

Anyway rough thoughts

mav · May 25, 2021, 3:56am

The checksum is to ensure the url does not have typos or errors. The checksum is a hash of the url not of the content at that url.

If there are typos in the url then the checksum has a very low chance of matching and an error can be shown, eg ‘do not send funds to this url since the checksum has failed and there’s probably a typo or transmission error’.

BIP39 has 1 bit of checksum for every 32 bits of data. So a 12 word mnemonic is 132 bits, which is 128 of data + 4 of checksum

It seems to me several fields of the url are acting as an equivalent of http headers Content-Type and Content-Encoding… not sure if it’s exactly true but that’s what comes to mind with this. Is the url an appropriate place for that metadata? I think it’s necessary to be there but it does seem a little strange to me for some reason.

The current state-of-the-art for checksum encoding (in bitcoin anyhow) is bech32m, well worth a look to understand why they moved to this encoding:

http://bitcoin.sipa.be/bech32/demo/demo.html

Not saying we should use bech32m but we should at least be aware of it and what problems it aims to solve.

Another interesting checksum is ethereum using capitalization specified in eip55. But this won’t work with base64 encoded urls.

Broadly, I feel that urls are so long anyway that another 4 bytes isn’t problematic. But 4 bytes does seem too much when we look at other standards.

bech32 addresses have a 30 bit checksum (6 character checksum, each char is from an alphabet of 32 chars, which is 5 bits per char). From bip173:
We pick 6 checksum characters as a trade-off between length of the addresses and the error-detection capabilities, as 6 characters is the lowest number sufficient for a random failure chance below 1 per billion. For the length of data we’re interested in protecting (up to 71 bytes for a potential future 40-byte witness program), BCH codes can be constructed that guarantee detecting up to 4 errors.