What is required to be IRI compliant?

🗣️ From: William Orr (will (a) worrbase.com)
📅 Sent: 2020-12-28 12:12
📧 Message 7 of 16
Hey,

Normalization is not so much turning e'crire to ?crire, but handling the 
multiple representations of the word '?crire'.

For example, the first character can be represented by multiple sets of 
unicode codepoints.


? can either be U+00E9 or it can also be the sequence of U+0065 U+0301 (e 
plus what's called a combining character). Both should render visibly as 
?, and the input method is free to produce whichever form.

Normalization is the process of looking for all of these synonyms for 
characters, and standardizing them to the same set of codepoints. If you 
don't normalize, you could have a case where one user gets the intended 
host for ?crire.hostname and another user gets an NXDOMAIN, all depending 
on the sequence of bytes their input method produced.

Server-side, you probably only need to normalize the request path after 
doing percent decoding, since you can't always trust that a client 
normalized the request path correctly.

To do normalization in C, the best lib that I know of is libicu. 
http://site.icu-project.org/

There are different types of normalization, but imo the only kind that 
server authors should care about is NFC, since it accomplishes the goal of 
standardizing the set of bytes you're looking up, while also keeping the 
characters composed in a way that makes sense to display (in like logs and 
stuff). Here's a technical report on all of the different normalization 
forms for more reading: https://www.unicode.org/reports/tr15/

Hope this helps!

28 dic. 2020 12:59:33 Solene Rapenne <solene at perso.pw>:

> On Mon, 28 Dec 2020 12:41:15 +0100
> "Solderpunk" <solderpunk at posteo.net>:
> 
>> On Mon Dec 28, 2020 at 12:15 PM CET, Solene Rapenne wrote:
>> 
>>> Requests such as the following are working well:
>>> 
>>> - gemini://ho?t/? ?.gmi
>>> - gemini://?//??.gmi
>>> 
>>> Honestly, I am very surprised it works?
>> 
>> Me too!? Are you using a third party library to parse URIs/IRIs, or did
>> you implement it yourself?? People have acted like there is no easy
>> availability of reliable libraries for this kind of thing in C.? If that
>> is false, it would be very good to know.
>> 
>> To be fair, for a server, in addition to being able to parse the request
>> IRI, there is also possibly the need to normalise it, e.g. the
>> server's idea of its domain name might involve two separate characters
>> (a basic vowel plus an accent symbol, say) while the request's version
>> uses a single combined character (or the other way around).? We might
>> spec that one form is required, but robustness would require checking.
>> It might be that *this* is the really hard requirement, rather than
>> simply parsing.
>> 
>> (servers seem to get off lighter than clients, as they don't e.g. need
>> to do DNS lookups or resolve relative URLs - which, by the way, seems to
>> be the correct terminology, not "absolutise" as I've confused people
>> with earlier)
>> 
>> Cheers,
>> Solderpunk
>> 
>> is something people have acteThe impression I've received from other 
people here is that
>> parsing an IRI in C is prohibitively difficult., in my code
>>> everything are simple char arrays (in C), but it does.
>>> 
>>> Are there other specifics handling required for being
>>> IRI compliant? I'm not sure I understood exactly what
>>> it means.?
>> 
> 
> the code doesn't check anything, it only serves what is requested [1].
> 
> I don't understand what you mean by normalizing the request.
> For the hostname, I see no reason to write "?crire.hostname" as
> "e'crire.hostname" if it what you mean.
> 
> What I see as an issue would be people using puny code if we go
> using IRI. That would mean the server will have to check the puny
> code of the hostname to check to a request using the punycode.
> 
> A library will certainly be required for that.
> 
> 
> [1]:
> ```
> /*
> * look for the first / after the hostname
> * in order to split hostname and uri
> */
> pos = strchr(request, '/');
> 
> if (pos != NULL) {
> ? /* if there is a / found */
> ? /* separate hostname and uri */
> ? estrlcpy(file, pos, strlen(pos)+1);
> ? /* just keep hostname in request */
> ? pos[0] = '\0';
> 
> ? /*
> ?? * use a default file if no file are requested this
> ?? * can happen in two cases gemini://hostname/
> ?? * gemini://hostname/directory/
> ?? */
> ? if (strlen(file) == 0)
> ??? estrlcpy(file, "/index.gmi", 11);
> ? if (file[strlen(file) - 1] == '/')
> ??? estrlcat(file, "index.gmi", sizeof(file));
> } else {
> ? /*
> ?? * there are no slash / in the request
> ?? */
> ? estrlcpy(file, "/index.gmi", 11);
> }
> estrlcpy(hostname, request, sizeof(hostname));
> ```
---
Previous in thread (6 of 16): 🗣️ Petite Abeille (petite.abeille (a) gmail.com)
Next in thread (8 of 16): 🗣️ Petite Abeille (petite.abeille (a) gmail.com)
View entire thread.