Three possible uses for IRIs

🗣️ From: William Orr (will (a) worrbase.com)
📅 Sent: 2020-12-08 11:46
📧 Message 13 of 32

Philip Linde writes:

> On Mon, 07 Dec 2020 20:06:27 -0800
> "Emma Humphries" <ech at emmah.net> wrote:
>
>> I'm perplexed that "ease of programming" is considered more important 
than "ease of adoption."
>
> Consider "ease of programming" and in particular stability a subset of
> "ease of adoption". There are numerous client and server
> implementations because it is easy to implement, and easy to maintain
> because the protocol is relatively stable even in these early stages.
> The different software allows people with different goals to adopt the
> protocol, and helps in weeding out shortcomings of clarity in the
> specification by analysis of the subtle differences between
> implementations.
>
>> You mention that not every language supports the libraries needed for 
internationalized URLs. 
>> 
>> What does that lose the project vs. accessibility and broader adoption 
by non-English-speaking users for who Gemini would be a boon with limited 
bandwidth and hardware?
>
> It seems more likely that a change to this end would hurt adoption.
> Numerous pieces of existing Gemini software would immediately be
> invalidated. Not all of them will be updated to accommodate the change.
> I could perhaps see a more pressing need for the change if internet
> users worldwide weren't already used to transliteration. It's such a
> small part as well. UTF-8 is acceptable (and default) in text/gemini
> documents, and the text content of a capsule can indeed be written in
> any of the scripts supported by Unicode.

Hey,

I'm new to this list, and a new Gemini user, but this topic is fairly 
important to me. It's discouraging to see a lot of fear-mongering around 
this topic already.

Some points that have come up a few times already in this thread as well 
as the IDN thread that I think are worth addressing:

1. Homograph attacks

Stephane has already mentioned in a different response that homograph 
attacks are fairly rare. I don't have the knowledge to say whether or not 
that's accurate, but I can speak to how they're mitigated.

In general, browsers will render the domain in the URI bar if all of the 
characters in the each section belong to the same script. As an example, 
https://?pple.com will not render correctly in Firefox in the URI bar, but 
https://?????.com/ will render correctly (both domains do not exist if you 
want to check).

The other half of this comes down to domain registrars not allowing 
registrations of domains with homographs (depends on the TLD, of course).

What this comes down to, is that Gemini clients, if they wish to mitigate 
this type of attack, should apply the same algorithm as web browsers. 
Again, given the preference for client certs for authenticating sessions, 
it doesn't seem like this attack would have dire consequences anyway.

I also think I saw someone mention that they're worried about it from the 
IRI side as well? That attack doesn't seem like much of a realistic case, 
since if they direct you to a different page on the same server, you're 
well, still on the same server. This only becomes problematic in the case 
of shared hosting of untrusted tenants.

2. Normalization

There's been a bit of fear-mongering about normalization which I can 
totally understand, since a first look at Unicode technical reports and 
the 4 normalization forms looks intimidating at first glance.

However, as pointed out in a few RFCs, NFC is more or less the only 
normalization form that you need to worry about in *most* circumstances. 
Typed URIs should be normalized in NFC, both on server-side and 
client-side. When resolving files to the filesystem, the filename should 
be normalized to NFC. (this all assumes that your fs supports Unicode paths).

NFKC becomes more relevant in the case that you want to implement 
something like search, or find in page, or something. You may want a user 
to be able to type in something like 'e' have their find include 
everything whose NFKC form is basically an 'e' (see the full set here: 
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3ANFKC_Casef
old%3De%3A%5D&g=&i=).

3. Language support

Normalization is generally supported across different languages p easily.

Python has it in its stdlib: 
https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize

Golang has support: https://pkg.go.dev/golang.org/x/text/unicode/norm

Rust: https://unicode-rs.github.io/unicode-normalization/unicode_normalization/index.html

C get its support through the venerable libicu library (you're already 
using libs for TLS): 
https://unicode-org.github.io/icu/userguide/transforms/normalization/

I will say that I don't know of any explicit IRI-handling libraries, nor 
do I know what the state of support is in different URI-handling 
libraries, but it will be something I play with as I work on gemini 
projects. I'm happy to share my experiences when I have more of them. :)

-

To address some non-technical points, I don't think that starting a new 
protocol and then deciding to ignore internationalization is necessarily 
the right way to go. In a lot of cases, internationalization sucks because 
of legacy support, and gemini doesn't *have* legacy to preserve 
compatibility. As I understand it, that's why TLS is mandatory, even 
though it arguably locks out some retro systems from being able to use it.

Personally, I'd like to see the spec say something about how this is 
handled before any type of freeze takes place.

--
worr
---
Previous in thread (12 of 32): 🗣️ Sean Conner (sean (a) conman.org)
Next in thread (14 of 32): 🗣️ bie (bie (a) 202x.moe)
View entire thread.