[RFC Discussion]: XOR-URLs

Okay - last post on this matter here - you suggested that I should open a pull request for the python implementation of base32z because it’s different to the one in JS. (which I didn’t do because I really don’t like this random encoding and would prefer nobody in this world would use and nobody gets motivated to use it… )

If I’d open this pull request and it would be accepted all python programs using base32z which have stored data would immediately loose this data… Because even the self describing cid would in both cases say base32z but you wouldn’t know if old or new generation…

So by offering a base32z resolution in the libs you can say you offer a solution for ‘same link representation independently from the base32z implementation of the programming language’ but how do you react if the JS/rust implementation of base32z changes (because it’s no standard and this obviously happened and could happen again) then suddenly either you implement your ‘not official’ version of base32z and are using an unnamed custom encoding (and working with ‘invalid cids’) or all links on safe sites don’t work anymore…

Ps: but maybe there was a very good reason to not only leave out L, V and 2 but also do exactly this re-ordering/assignment of characters in base32z

base32: a -> y base32z
base32: b -> b base32z
base32: c -> n base32z
base32: d -> d base32z
base32: e -> r base32z
base32: f -> f base32z
base32: g -> g base32z
base32: h -> 8 base32z
base32: i -> e base32z
base32: j -> j base32z
base32: k -> k base32z
base32: l -> m base32z
base32: m -> c base32z
base32: n -> p base32z
base32: o -> q base32z
base32: p -> x base32z
base32: q -> o base32z
base32: r -> t base32z
base32: s -> 1 base32z
base32: t -> u base32z
base32: u -> w base32z
base32: v -> i base32z
base32: w -> s base32z
base32: x -> z base32z
base32: y -> a base32z
base32: z -> 3 base32z
base32: 2 -> 4 base32z
base32: 3 -> 5 base32z
base32: 4 -> h base32z
base32: 5 -> 7 base32z
base32: 6 -> 6 base32z
base32: 7 -> 9 base32z

and nobody will feel the need to change it again ever :slightly_smiling_face:

I think it’s been checked against more than the JS one, but I’m not sure to be honest. (@bochaco do you know more there?). You may be right, perhaps the JS version is off and we should patch it. Worth checking for sure. Either way, whichever implementation is broken should be fixed, otherwise it could lead to problems for folk :+1: .

There is this only spec I can find.: http://philzimmermann.com/docs/human-oriented-base-32-encoding.txt (I’m not sure what the python impl follows?). What would make it ‘standard’? Surely depends on your favourite standards body no? (Much as there seem to be a few different URL standards), interoperability is key.

If an implementation fails to meet the spec, then there’s an issue there… Same as if an implementation fails with base64 too. Similar consequences I’d imagine. But I’m sorry, I don’t see the possibility of bad implementations as an argument for or against any given spec.

All of which is beside the point, since with CID, you can use your favourite encoding.

If there’s a pressing need for many different encodings to be used via SCL, then perhaps the client lib implementation could allow for passing in your own functionality for the hashing function…?

(Or in general: if the client lib API implementation isn’t something you like, then you’re free to not use it and implement a new API for CID creation, eg. As long as it’s using CID then the urls should be decodable, I believe.)

So far though, I personally still don’t see a compelling reason not to use z-base32 right now though.

Do I understand that the CID encoding scheme leads to multiple possible URIs for the same resource?

If so, that seems undesirable and I think might go against the ‘standard web’. I think I posted a link to this address ages ago in a related discussion. I think we should follow web standards and conventions unless there’s a good reason not to, so need to consider this and examine any implications - if the answer to my question is yes!

1 Like

No, I haven’t digged into the implementation of it at all, for sure we’ll need to double check this with some automated tests whener this becomes the official impl in SCL (I mean official in the sense of non-PoC), e.g. compare some hard-coded XORnames with the output encoded XOR-URL and make sure it’s the expected. As you say we can send PRs to whichever impl we find bugs in, this is where open source shows its powers I believe.

That’s correct, but I don’t see a problem with that (my humble point of view), you also have more than one URL referencing the same resource if you create public name URLs. I can admit at the beginning it also sounded bad to me, but if you are using a URL perhaps as part of a contract which references an ImmD, then that will be ok even if there are alisases or other URLs to the same resource.

On a side note, I’m trying to understand a bit more about several other aspects of the network, like the new Appendable data, to see if this is impacted somehow, or perhaps the safecoin and how public keys for transfering safecoins could be also used and/or impact this RFC.

That’s the one and only reason of cid existence - to be able to represent the same sequence of bytes by the encoding of your liking (hex, base32, base32z, base 2, base, 64,…) it adds information about the data that follows in front of it and then presents the string - no error detection, no other benefits…(that’s why I said it doesn’t make sense to exclude any data - like the type tag - from the cid because that defeats the only reason one would use them…)


Which you wouldn’t even need to think of if you used base16 (hex) or base32…

Well - being widely used and clearly +reasonable (and therefore unlikely to change) specified - for example hex is widely used and everyone knows what number is represented by which character - (and easy to read - easy to type with numpad+left hand at abcdef) … So a standard… base32 takes a-z+2-7 in the sequence you’d expect (alphabetic +small to large)… base32z takes alphabet (without L and V) + 3-9 and shuffles them around …

Okay - yes - unambiguously defined - that meets my requirement…

While this is still super subjective and random imho (and we are talking about more or less random bytes… There is no such thing as ‘more commonly occurring characters’…)

Not entirely true.

It allows us to string together a variety of information in a standardised fashion, which can be useful. (At least IPFS have found it to be useful). Providing a means to change encodings/hash functions (giving us a means to improve URLs in the future) as you note, and provide more info to boot (mimetypes eg).

You’re right though, maybe it doesn’t go far enough? Typetag certainly won’t work in the port number, eg.

I think I’ve not really followed your error detection point before. But aye, that could be an interesting property to have.

Unless there was a problem in an implementation…

True - while you could argue using those types as encoding function +adding the info how you encodes the data pretty much has the same result as just adding this encoding function as identifier for the mimetype as bytes of the payload… So while you are of course correct I don’t think the message of my statement is wrong… (and if you leave out the encoding option and just go with one - like with mutable data - or probably safecoin - it’s just encoded bytes as I said)

Yeah - right - sorry but thinking that even a small child would implement hex wrong is kind of ridiculous - while implementing base32z is obviously a very different beast

On the one hand, you argue for standardisation of one thing (base32), on another, you argue to create our own version of the well tested CID system.

I understand that bugs are frustrating. But one bad implementation doesn’t negate the benefits of z-base32 though, IMO. Especially considering all that’s outlines above (re: API to avoid need for going manual, human readability etc).

Equally, I don’t see a fleshed out benefit to doing away with CIDs which give us the flexibility to update the XOR scheme as we go, in a standardised, tested, and community supported fashion. If we have issues with, or additions we want to make. We can put them forwards, and improve this standardised system for everyone. From @bochaco’s PR they seem very open to this.

Where did I do that? I suggested to just include all relevant bytes in a base32 encoded cid’

(or just use the bytes and encode them hex - my first suggestion and still my favourite one…)

Hey - but just go with what you decided for - people will use it or complain/use a different way to share addresses - no reason to waste many hours just because we disagree on importance of some aspects - we’ll just see and react to what happens

:+1: (Thanks. I couldn’t actually find where I’d read that.)

But aye, I think there’s certainly merit to getting typetag’s in there. And error correction could be useful too :+1: (did you have an example of a URL structure with that? ).

That’s me for the night now. Gotta think about :taco: some :bowing_man:.

1 Like

Hmm - actually I was just thinking about a simple 16 bit checksum for error detection - no error correction (because that would be at least 32 bit for one xor name one bit error correction without any additional data… (from memory… Details on block codes can be found here: https://en.m.wikipedia.org/wiki/Block_code - but all that comes with additional complexity - not sure if that is worth the hassle - just giving immediate feedback on error occurance without network calls might be good enough I would guess…)And a wrong character is most probably more than one wrong bit but a couple of them… )

Part of the confusion or disagreement I think has to do with your suggestion @riddim to encode several other things in the CID, like checksum and typetag, and you claim that would still be a CID, that’s what I believe is incorrect, the CID spec is very specific to what goes in each part of it:

So if you let’s say put: concat(<xor addr>, <type tag>, <checksum>) as the string for the <multihash-content-address> part, then what you have, strictly speaking, is not a CID anymore, as simple as that, it’s another type of content id we define using the multiformats and multicodecs, this is what I think @joshuef is meaning by “create our own version of the well tested CID system”.

Am I against creating our CID so we can incorporate type tag and possibly checksum (which makes sense to me only if we can extend XOR-URLs to use them for safecoin wallets) ? , no, not necessarily, yes I was trying to avoid it if possible to not come up with my own (non-standard) spec of encoding. But if we have to, and we do it, then:

  • We shouldn’t claim we have a CID anymore but our own-CID spec and therefore some of the implementations already available need to be forked and adapted to our own-CID spec (which is not a big deal since we will do it anyway by embedding in the SCL API)
  • We may want to actually propose that to multiformats project as an enhancement to the CID, and work with them on having that to be part of the CID spec maintained in that project.
  • We would be contradicting (no doubt) any argument where we say we don’t like baseX because it’s not “standard”, so we would have to leave that type of arguments out of the equation

I never tried to study this or understand how studies around this were made (I guess you just measure the average occurrance of letters in words of a dictionary), but by only looking at the worn off keys in my own keyboard I have to disagree or be at least skeptical about this statement. In any case, he is saying that since those are most used by humans then it’s easier to read or write than other less used, who knows, but I do like those replacements like removing the 0, L, etc. to avoid confusions

We are talking about xor names (+stuff) … If they are not roughly equally distributed the network is not in balance…


That’s correct @riddim , but I guess you see that the proposal that guy is making is that since those letters are more used in the human vocabulary (assuming english) then potentially they are easy to read and write when used in any string you are encoding with them than others, regardless if you are encoding xornames.

Since we touched on standards a bit, perhaps it’s a nice chance for me to share some thoughts I was having about it in the last few months, not that important for this discussion maybe, but why not share it here, this is all my own personal especulation and how I perceive it.

I think standards, in many cases in the past have been designed and worked out by big organisations which were able to not only invest/spend the money for having people in many long meetings where those standards were defined and documented, but also which were monopolising (well not exactly of course as they’d be a group of organisations) in some way many fields with their products; so if you were a small company or an individual you simply didn’t have the chance to participate in there, and you don’t have much choice but just follow those standars with no vote if you wanna sell anything you produce.

Nowadays, and I think more as we move forward with decentralising several things, I believe small companies or just individuals have more chances to compete with these companies and organisations since they can reach the end users directly, and when that happens they are in a good position to start pushing for any new “standard” that perhaps wasn’t available or defined before. Other projects, companies, individuals may follow that new spec almost immediatelly, to be able to participate in a potentially new wave of sucessfull type of application/service/product, and they won’t wait for any commitee or organisation to gather and agree on the new “standard” they can use, they will just move forward. So I guess I see a decentralisation in this regard as well, who defines a standard? …just some random thoughts :slight_smile:


[ignoring how questionable this thesis by itself is] while they for sure won’t appear more often when encoding xor names (or all compressed or binary data)

So the one and only advantage of base32z is leaving out L (using 1 instead) , V and 2 (0 and 1 are not part of base32 as well

Ah but there is enough left anyway… nm ec nh vvw rv S5 dq hk ft 1f 1t qg pa yx… (ofc always depending on your hand writing - with printed it shouldn’t be a problem either way… )

If you really would want to prevent character mixups you’d go with hex

edit/PS - about standards and encodings/readability

Ps: @bochaco I do agree with what you said about standards - and without someone starting with it no new standards would appear… But I can only support standards that ‘make sense’ imo… Base32z to me just looks like a random definition by one guy who had an idea… The real problem with bitcoin keys for example is that they use upper and lower case and oO0(Q) Il1 look very similar depending on the font you choose… The rest above that is more for hand written stuff and base32 / base32z seem to solve the super problematic characters I mentioned there similar good and when it comes to hand writing there would only be marginally different results in readability + the real solution that would make a difference would be reducing the character set to 16…

… The difference in readability between base32 and base32z looks to me more like a philosophical question and therefore I would definitely go with the wider spread one … (the base32z description is missing any proof for what is claimed imo …)

And that’s too why I really am bored of this topic here by now tbh … At the end of the day I don’t care too much which encoding you choose - I will just work with what I get (and implement base58+base32(+safeBrowser)+hex for python - doesn’t make sense to loop such a little task through the api and needing to take care of forwarded errors returned from rust…) and if many others share my opinion you will get asked the same questions again and again and it’s you who will need to defend your decision again and again :wink: I just want one clear definition that is consistent in itself and if I get that it’s okay for my world :wink:

I said what I had to say about this topic and if you consider implementing block codes for data validation/correction that is a way simpler task than it looks when reading the Wikipedia article (really pretty simple) and I can support you in implementing them too if you want - but I personally wouldn’t aim for too much because it becomes a pretty big overhead for little benefit if you want error correction…

1 Like

I always have a hard time with 1,lowercase l and capital I. None of these please!

1 Like

… yes you really need the right font for them to be easily identifiable…

But that’s never a problem with encodings to base 32 - it’s either all uppercase or all lowercase

(base32 using l+i / L+I and base32z using 1+i / 1+I… - none uses all 3 of l, i and 1)