💾 Archived View for gemi.dev › gemini-mailing-list › 000578.gmi captured on 2024-06-16 at 13:47:15. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2023-12-28)

-=-=-=-=-=-=-

What is required to be IRI compliant?

1. Solene Rapenne (solene (a) perso.pw)

Hello,

I am the author of a gemini daemon written in C. In
regards to the discussion about IRI, I added tests
to my project to try utf-8 or emojis queries.

Requests such as the following are working well:

- gemini://ho?t/? ?.gmi
- gemini://?//??.gmi

Honestly, I am very surprised it works, in my code
everything are simple char arrays (in C), but it does.

Are there other specifics handling required for being
IRI compliant? I'm not sure I understood exactly what
it means.

Link to individual message.

2. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 28, 2020, at 12:15, Solene Rapenne <solene at perso.pw> wrote:
> 
> Requests such as the following are working well:
> 
> - gemini://ho?t/? ?.gmi
> - gemini://?//??.gmi
> 
> Honestly, I am very surprised it works

Fantastic. Welcome to the future. You have done it.

And yes, the only major difference between URI and IRI is the character 
set allowed in the various URL parts: ASCII vs UTF8. (? resolution 
details, unicode normalization, and minor other annoyances).

Very well done. Full credits.

Link to individual message.

3. Solderpunk (solderpunk (a) posteo.net)

On Mon Dec 28, 2020 at 12:15 PM CET, Solene Rapenne wrote:

> Requests such as the following are working well:
>
> - gemini://ho?t/? ?.gmi
> - gemini://?//??.gmi
>
> Honestly, I am very surprised it works

Me too!  Are you using a third party library to parse URIs/IRIs, or did
you implement it yourself?  People have acted like there is no easy
availability of reliable libraries for this kind of thing in C.  If that
is false, it would be very good to know.

To be fair, for a server, in addition to being able to parse the request
IRI, there is also possibly the need to normalise it, e.g. the
server's idea of its domain name might involve two separate characters
(a basic vowel plus an accent symbol, say) while the request's version
uses a single combined character (or the other way around).  We might
spec that one form is required, but robustness would require checking.
It might be that *this* is the really hard requirement, rather than
simply parsing.

(servers seem to get off lighter than clients, as they don't e.g. need
to do DNS lookups or resolve relative URLs - which, by the way, seems to
be the correct terminology, not "absolutise" as I've confused people
with earlier)

Cheers,
Solderpunk

is something people have acteThe impression I've received from other people here is that
parsing an IRI in C is prohibitively difficult., in my code
> everything are simple char arrays (in C), but it does.
>
> Are there other specifics handling required for being
> IRI compliant? I'm not sure I understood exactly what
> it means.

Link to individual message.

4. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 28, 2020, at 12:20, Petite Abeille <petite.abeille at gmail.com> wrote:
> 
>> Honestly, I am very surprised it works
> 
> Fantastic. Welcome to the future. You have done it.

In case the "tone of voice" is not clear, yes, you have done it. Most 
earnest congratulations.

Link to individual message.

5. Solene Rapenne (solene (a) perso.pw)

On Mon, 28 Dec 2020 12:41:15 +0100
"Solderpunk" <solderpunk at posteo.net>:

> On Mon Dec 28, 2020 at 12:15 PM CET, Solene Rapenne wrote:
> 
> > Requests such as the following are working well:
> >
> > - gemini://ho?t/? ?.gmi
> > - gemini://?//??.gmi
> >
> > Honestly, I am very surprised it works  
> 
> Me too!  Are you using a third party library to parse URIs/IRIs, or did
> you implement it yourself?  People have acted like there is no easy
> availability of reliable libraries for this kind of thing in C.  If that
> is false, it would be very good to know.
> 
> To be fair, for a server, in addition to being able to parse the request
> IRI, there is also possibly the need to normalise it, e.g. the
> server's idea of its domain name might involve two separate characters
> (a basic vowel plus an accent symbol, say) while the request's version
> uses a single combined character (or the other way around).  We might
> spec that one form is required, but robustness would require checking.
> It might be that *this* is the really hard requirement, rather than
> simply parsing.
> 
> (servers seem to get off lighter than clients, as they don't e.g. need
> to do DNS lookups or resolve relative URLs - which, by the way, seems to
> be the correct terminology, not "absolutise" as I've confused people
> with earlier)
> 
> Cheers,
> Solderpunk
> 
> is something people have acteThe impression I've received from other 
people here is that
> parsing an IRI in C is prohibitively difficult., in my code
> > everything are simple char arrays (in C), but it does.
> >
> > Are there other specifics handling required for being
> > IRI compliant? I'm not sure I understood exactly what
> > it means.  
> 

the code doesn't check anything, it only serves what is requested [1].

I don't understand what you mean by normalizing the request.
For the hostname, I see no reason to write "?crire.hostname" as
"e'crire.hostname" if it what you mean.

What I see as an issue would be people using puny code if we go
using IRI. That would mean the server will have to check the puny
code of the hostname to check to a request using the punycode.

A library will certainly be required for that.


[1]:
 ```
/*
 * look for the first / after the hostname
 * in order to split hostname and uri
 */
pos = strchr(request, '/');

if (pos != NULL) {
	/* if there is a / found */
	/* separate hostname and uri */
	estrlcpy(file, pos, strlen(pos)+1);
	/* just keep hostname in request */
	pos[0] = '\0';

	/*
	 * use a default file if no file are requested this
	 * can happen in two cases gemini://hostname/
	 * gemini://hostname/directory/
	 */
	if (strlen(file) == 0)
		estrlcpy(file, "/index.gmi", 11);
	if (file[strlen(file) - 1] == '/')
		estrlcat(file, "index.gmi", sizeof(file));
} else {
	/*
	 * there are no slash / in the request
	 */
	estrlcpy(file, "/index.gmi", 11);
}
estrlcpy(hostname, request, sizeof(hostname));
 ```

Link to individual message.

6. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 28, 2020, at 12:59, Solene Rapenne <solene at perso.pw> wrote:
> 
> For the hostname, I see no reason to write "?crire.hostname" as
> "e'crire.hostname" if it what you mean.

The crux of the issue, with Unicode -and UTF8 by extension- is that the 
same character, say, the acute accent "?",  can be represented in 
different ways, encoding wise, while visually identical. 

This is what people means by "normalization": everyone needs to agree how 
to encode "?" the same way, so everyone understand what "?" is.

This is a solved problem. But it introduces additional work.

Link to individual message.

7. William Orr (will (a) worrbase.com)

Hey,

Normalization is not so much turning e'crire to ?crire, but handling the 
multiple representations of the word '?crire'.

For example, the first character can be represented by multiple sets of 
unicode codepoints.


? can either be U+00E9 or it can also be the sequence of U+0065 U+0301 (e 
plus what's called a combining character). Both should render visibly as 
?, and the input method is free to produce whichever form.

Normalization is the process of looking for all of these synonyms for 
characters, and standardizing them to the same set of codepoints. If you 
don't normalize, you could have a case where one user gets the intended 
host for ?crire.hostname and another user gets an NXDOMAIN, all depending 
on the sequence of bytes their input method produced.

Server-side, you probably only need to normalize the request path after 
doing percent decoding, since you can't always trust that a client 
normalized the request path correctly.

To do normalization in C, the best lib that I know of is libicu. 
http://site.icu-project.org/

There are different types of normalization, but imo the only kind that 
server authors should care about is NFC, since it accomplishes the goal of 
standardizing the set of bytes you're looking up, while also keeping the 
characters composed in a way that makes sense to display (in like logs and 
stuff). Here's a technical report on all of the different normalization 
forms for more reading: https://www.unicode.org/reports/tr15/

Hope this helps!

28 dic. 2020 12:59:33 Solene Rapenne <solene at perso.pw>:

> On Mon, 28 Dec 2020 12:41:15 +0100
> "Solderpunk" <solderpunk at posteo.net>:
> 
>> On Mon Dec 28, 2020 at 12:15 PM CET, Solene Rapenne wrote:
>> 
>>> Requests such as the following are working well:
>>> 
>>> - gemini://ho?t/? ?.gmi
>>> - gemini://?//??.gmi
>>> 
>>> Honestly, I am very surprised it works?
>> 
>> Me too!? Are you using a third party library to parse URIs/IRIs, or did
>> you implement it yourself?? People have acted like there is no easy
>> availability of reliable libraries for this kind of thing in C.? If that
>> is false, it would be very good to know.
>> 
>> To be fair, for a server, in addition to being able to parse the request
>> IRI, there is also possibly the need to normalise it, e.g. the
>> server's idea of its domain name might involve two separate characters
>> (a basic vowel plus an accent symbol, say) while the request's version
>> uses a single combined character (or the other way around).? We might
>> spec that one form is required, but robustness would require checking.
>> It might be that *this* is the really hard requirement, rather than
>> simply parsing.
>> 
>> (servers seem to get off lighter than clients, as they don't e.g. need
>> to do DNS lookups or resolve relative URLs - which, by the way, seems to
>> be the correct terminology, not "absolutise" as I've confused people
>> with earlier)
>> 
>> Cheers,
>> Solderpunk
>> 
>> is something people have acteThe impression I've received from other 
people here is that
>> parsing an IRI in C is prohibitively difficult., in my code
>>> everything are simple char arrays (in C), but it does.
>>> 
>>> Are there other specifics handling required for being
>>> IRI compliant? I'm not sure I understood exactly what
>>> it means.?
>> 
> 
> the code doesn't check anything, it only serves what is requested [1].
> 
> I don't understand what you mean by normalizing the request.
> For the hostname, I see no reason to write "?crire.hostname" as
> "e'crire.hostname" if it what you mean.
> 
> What I see as an issue would be people using puny code if we go
> using IRI. That would mean the server will have to check the puny
> code of the hostname to check to a request using the punycode.
> 
> A library will certainly be required for that.
> 
> 
> [1]:
> ```
> /*
> * look for the first / after the hostname
> * in order to split hostname and uri
> */
> pos = strchr(request, '/');
> 
> if (pos != NULL) {
> ? /* if there is a / found */
> ? /* separate hostname and uri */
> ? estrlcpy(file, pos, strlen(pos)+1);
> ? /* just keep hostname in request */
> ? pos[0] = '\0';
> 
> ? /*
> ?? * use a default file if no file are requested this
> ?? * can happen in two cases gemini://hostname/
> ?? * gemini://hostname/directory/
> ?? */
> ? if (strlen(file) == 0)
> ??? estrlcpy(file, "/index.gmi", 11);
> ? if (file[strlen(file) - 1] == '/')
> ??? estrlcat(file, "index.gmi", sizeof(file));
> } else {
> ? /*
> ?? * there are no slash / in the request
> ?? */
> ? estrlcpy(file, "/index.gmi", 11);
> }
> estrlcpy(hostname, request, sizeof(hostname));
> ```

Link to individual message.

8. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 28, 2020, at 13:04, Petite Abeille <petite.abeille at gmail.com> wrote:
> 
> This is what people means by "normalization": everyone needs to agree 
how to encode "?" the same way, so everyone understand what "?" is.

See https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization for an introduction.

This is not an abstract problem, as it can lead to unwelcome outcomes, see 
#Errors_due_to_normalization_differences . Or simply misunderstands.

Furthermore, if you really want to go all the way in, you need to validate 
the UTF8 byte sequences as well.

In the same way as there are different ways to represent the very same 
character in Unicode, there are various ways to encode that character in 
Unicode Transformation Format (UTF). Some of them malicious.

See https://en.wikipedia.org/wiki/UTF-8#Invalid_sequences_and_error_handlin
g for an introduction.

In your case, considering your stack, you can choose to ignore such 
potential issues, or delegate them to some external libraries.

Your choice, ultimately.

In any case, well done to get it going with the minimal amount of work. 
This is the way. ??

Link to individual message.

9. Philip Linde (linde.philip (a) gmail.com)

On Mon, 28 Dec 2020 12:59:12 +0100
Solene Rapenne <solene at perso.pw> wrote:

> I don't understand what you mean by normalizing the request.
> For the hostname, I see no reason to write "?crire.hostname" as
> "e'crire.hostname" if it what you mean.

There are sometimes multiple ways to represent the same characters in
Unicode. For example "?" in the composed form is just U+00E9, and in
the decomposed form it's U+0065, U+0301. These are visually
indistinguishable and take the same meaning in Unicode, but encoded as
UTF-8 or UTF-32 a byte-by-byte or even code point-by-code point
comparison will not indicate that they are.

Therefore the Unicode consortium defines a process called normalization
to either "fully compose" characters (turn sequences like U+0065,
U+0301 into their composed form, U+00E9) or "fully decompose" (which
works the other way around). To support this you'd need a database that
the Unicode consortium distributes.

If you are using GLib (or are ready to use GLib) it has functions for
this.

> What I see as an issue would be people using puny code if we go
> using IRI. That would mean the server will have to check the puny
> code of the hostname to check to a request using the punycode.
> 
> A library will certainly be required for that.

Again, if you're willing to use GLib, some functions for this exist in
glib/gi18n.h

-- 
Philip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201228/43a8
32d2/attachment.sig>

Link to individual message.

10. Solderpunk (solderpunk (a) posteo.net)

On Mon Dec 28, 2020 at 12:59 PM CET, Solene Rapenne wrote:

> the code doesn't check anything, it only serves what is requested [1].

Okay, I just read the complete Vger code, and see that the URL handling
is indeed pretty minimal.  E.g. domain-based virtual hosting would break
if a request included an explicit port (which would be interpreted as
part of the hostname), and filenames with spaces in them cannot be
served (because the %20 encoding of the space character in the URL path
is not reversed by the server), and queries (which make no sense for
static content and should be ignored) are not separated from the path.

I don't mean this as criticism, obviously for hobby servers something
like this is perfectly workable.  I guess the concern is that we ideally
want even "production grade" servers to be quite easy to write.  Such a
server would need to handle all the things above and more.  Few people
want to do all that parsing by hand, so would use a library, and because
URI parsing is so widely used the theory is it's never too hard to find
a library for your favourite language with your favourite license.  This
is why there are already plenty of production grade (or close to it)
servers and clients out there.  But apparently many URI parsing
libraries have never been upgraded to also do IRIs.  I had hoped Vger's
surprising IRI compatibility meant there was easily available C code for
doing this after all, but it seems not.

So, while this is still very happy news, it's no rebuttal of the
argument that switching to IRIs would make writing a solid server in C
(or other languages without extensive and modern standard libraries, C
is not really special here) into a considerably more difficult
undertaking.

Cheers,
Solderpunk

Link to individual message.

11. Solene Rapenne (solene (a) perso.pw)

On Mon, 28 Dec 2020 14:10:36 +0100
"Solderpunk" <solderpunk at posteo.net>:

> On Mon Dec 28, 2020 at 12:59 PM CET, Solene Rapenne wrote:
> 
> > the code doesn't check anything, it only serves what is requested [1].  
> 
> Okay, I just read the complete Vger code, and see that the URL handling
> is indeed pretty minimal.  E.g. domain-based virtual hosting would break
> if a request included an explicit port (which would be interpreted as
> part of the hostname), and filenames with spaces in them cannot be
> served (because the %20 encoding of the space character in the URL path
> is not reversed by the server), and queries (which make no sense for
> static content and should be ignored) are not separated from the path.
 
indeed, using the port in the query would currently break, a patch will
be published soon. I had very few feedback about the daemon and it seem
to work for most people so yes, issues (or lacks of special cases)
are still to be found, but I'd like vger to be production grade though.

> I don't mean this as criticism, obviously for hobby servers something
> like this is perfectly workable.  I guess the concern is that we ideally
> want even "production grade" servers to be quite easy to write.  Such a
> server would need to handle all the things above and more.  Few people
> want to do all that parsing by hand, so would use a library, and because
> URI parsing is so widely used the theory is it's never too hard to find
> a library for your favourite language with your favourite license.  This
> is why there are already plenty of production grade (or close to it)
> servers and clients out there.  But apparently many URI parsing
> libraries have never been upgraded to also do IRIs.  I had hoped Vger's
> surprising IRI compatibility meant there was easily available C code for
> doing this after all, but it seems not.
> 
> So, while this is still very happy news, it's no rebuttal of the
> argument that switching to IRIs would make writing a solid server in C
> (or other languages without extensive and modern standard libraries, C
> is not really special here) into a considerably more difficult
> undertaking.
> 
> Cheers,
> Solderpunk

I'll take a look at libcu as suggested. I think it's widely available
and certainly already there for most people because it must be a very
common dependency so I wouldn't considered it a bloat to use libcu.

If I get something done with libcu, I'll share it here if someone
is interested. For others languages, it would still be possible to
use ffi to call the C libicu into your favorite programming language.

Obviously, supporting IRI is far more complicated than answering to a
query containing special characters. That's why I asked on the mailing
list, I lack knowledge in this field and I couldn't imagine all cases.

Link to individual message.

12. Solderpunk (solderpunk (a) posteo.net)

On Mon Dec 28, 2020 at 1:12 PM CET, William Orr wrote:

> Normalization is the process of looking for all of these synonyms for
> characters, and standardizing them to the same set of codepoints. If you
> don't normalize, you could have a case where one user gets the intended
> host for ?crire.hostname and another user gets an NXDOMAIN, all
> depending on the sequence of bytes their input method produced.

...and actually, now that I think about, this issue is not specific to
IRI support, is it?  Even if we followed the web's lead and declared
that Gemini requests and text/gemini links must contain ASCII-only URLs,
and people have to do punycoding of non-ASCII hostnames and
percent-encoding of UTF-8 representations of non-ASCII paths, it's still
possible for the server and client to have different ideas about how a
hostname or path are represented, right?  With one using a composed form
and the other a decomposed form?  Whether you send a UTF-8 string as-is
or first punycode and/or percent-encode it so it's valid ASCII is
totally orthogonal to that question.  Or have I missed something
important?

Cheers,
Solderpunk

Link to individual message.

13. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 28, 2020, at 14:54, Solderpunk <solderpunk at posteo.net> wrote:
> 
> ...and actually, now that I think about, this issue is not specific to
> IRI support, is it?

Correct. Unicode normalization is an issue in and by itself. 

Of course, as demonstrated by Solene, there are various level of 
compliance: from simple, but functional, to bullet proof, industrial grade, and complex.

It's a spectrum. Like this mailing list :D

Link to individual message.

14. William Orr (will (a) worrbase.com)

Yes, that's absolutely the case even if IRIs are not used. Hopefully URI 
libraries and IDNA libraries handle this correctly by doing normalization 
before percent encoding/punycoding, but I haven't checked any implementations personally.

Normalization should come up in other contexts as well that would be 
common in gemini, like "find in page," search indexing, etc. Those cases 
may even make use of other normalization schemes like NFKC.

28 dic. 2020 15:01:41 Solderpunk <solderpunk at posteo.net>:

> On Mon Dec 28, 2020 at 1:12 PM CET, William Orr wrote:
> 
>> Normalization is the process of looking for all of these synonyms for
>> characters, and standardizing them to the same set of codepoints. If you
>> don't normalize, you could have a case where one user gets the intended
>> host for ?crire.hostname and another user gets an NXDOMAIN, all
>> depending on the sequence of bytes their input method produced.
> 
> ...and actually, now that I think about, this issue is not specific to
> IRI support, is it?? Even if we followed the web's lead and declared
> that Gemini requests and text/gemini links must contain ASCII-only URLs,
> and people have to do punycoding of non-ASCII hostnames and
> percent-encoding of UTF-8 representations of non-ASCII paths, it's still
> possible for the server and client to have different ideas about how a
> hostname or path are represented, right?? With one using a composed form
> and the other a decomposed form?? Whether you send a UTF-8 string as-is
> or first punycode and/or percent-encode it so it's valid ASCII is
> totally orthogonal to that question.? Or have I missed something
> important?
> 
> Cheers,
> Solderpunk

Link to individual message.

15. Stephane Bortzmeyer (stephane (a) sources.org)

On Mon, Dec 28, 2020 at 12:12:57PM +0000,
 William Orr <will at worrbase.com> wrote 
 a message of 129 lines which said:

> There are different types of normalization, but imo the only kind
> that server authors should care about is NFC,

And also because there is an Internet standard about Unicode network
format and it mandates NFC. RFC 5198
<gemini://gemini.bortzmeyer.org/rfc-mirror/rfc5198.txt>

Link to individual message.

16. Stephane Bortzmeyer (stephane (a) sources.org)

On Mon, Dec 28, 2020 at 02:54:14PM +0100,
 Solderpunk <solderpunk at posteo.net> wrote 
 a message of 22 lines which said:

> it's still possible for the server and client to have different
> ideas about how a hostname or path are represented, right?  With one
> using a composed form and the other a decomposed form?

Not right. RFC 5891, the standard for IDN, states that "By the time a
string enters the IDNA registration process as described in this
specification, it MUST be in Unicode and in Normalization Form C (NFC
[Unicode-UAX15])." So, at least the hostname has no problem.

gemini://gemini.bortzmeyer.org/rfc-mirror/rfc5891.txt

Link to individual message.

---

Previous Thread: [spec] Adapting the HTTP Common Logging Format for use by Gemini servers

Next Thread: [user] [ot] flame warriors: a full taxonomy