💾 Archived View for gemi.dev › gemini-mailing-list › 000524.gmi captured on 2024-05-12 at 16:08:35. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

IDN with Gemini?

📧 Messages: 68
🗣️ Authors: 13
📅 First Message: 2020-12-04 13:51
📅 Last Message: 2020-12-09 21:44

1. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-04 13:51
📧 Message 1 of 68

The specification
<gemini://gemini.circumlunar.space/docs/specification.gmi> seems
silent about IDN (Internationalized Domain Names, domains in Unicode,
see RFC 5890). The spec mentions URI (not 3986) but not IRI
(Internationalized Resource Identifiers, RFC 3987).

Therefore, it is not clear what servers and clients should do (send an
IRI, or accept IRI but convert it to URI or something else). A test
with some clients seem to indicate it does not work (tested at
<gemini://g?meaux.bortzmeyer.org/>):


	Amfora claims the domain name does not exist (it does exist),

"Failed to connect to the server: dial tcp: lookup
g?meaux.bortzmeyer.org: no such host."


	AV-98 does not protest and sends the IRI but the server I use

(Gemserv) fails to match either using the UTF-8 form (U-label) or the
Punycode one (A-label),

- Bombadillo says "Found "?", expected EOF"

What is the normal behaviour?

Link to individual message.

2. John Cowan (cowan (a) ccil.org)

📅 Sent: 2020-12-04 14:46
📧 Message 2 of 68

On Fri, Dec 4, 2020 at 8:53 AM Stephane Bortzmeyer <stephane at sources.org>
wrote:

Therefore, it is not clear what servers and clients should do (send an
> IRI, or accept IRI but convert it to URI or something else).
>

It seems clear from the behavior of web browsers that the Right Thing is to
convert all IDNs to Punycode before putting them on the wire.  By the same
token, all non-ASCII characters in other parts should be UTF-8 encoded and
then %-encoded before transmission.  This applies both to IRIs entered by
hand and IRIs appearing in links.

> * Amfora claims the domain name does not exist (it does exist),
> "Failed to connect to the server: dial tcp: lookup
> g?meaux.bortzmeyer.org <http://xn--gmeaux-bva.bortzmeyer.org>: no such
> host."
>

I'm pretty sure this is because no punycoding is being done in the DNS, and
it's probably getting the UTF-8 encoding instead of "
xn--gmeaux-bva.bortzmeyer.org".  When I ask Lagrange to connect to the
punycoded form explicitly, your server does not recognize it as "self" and
replies with "Proxy Request Refused".

I can't account for the behavior of the other servers easily.

John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
One art / There is / No less / No more
To do / All things / With sparks / Galore   --Douglas Hofstadter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201204/77b4
431f/attachment.htm>

Link to individual message.

3. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-04 15:00
📧 Message 3 of 68

On Fri, Dec 04, 2020 at 09:46:40AM -0500,
 John Cowan <cowan at ccil.org> wrote 
 a message of 93 lines which said:

> It seems clear from the behavior of web browsers that the Right Thing is to
> convert all IDNs to Punycode before putting them on the wire.

I disagree. This is not because HTTP does it that way that everyone
else should. IMHO, the right behaviour would be:


	parse the IRI and extract the domain name
	convert it to Punycode
	do the DNS lookup
	connect to the IP address and send the IRI as request


This leaves open interesting issues such as Unicode normalization but
it would be more natural for a new protocol, free of HTTP legacy.

Link to individual message.

4. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

📅 Sent: 2020-12-04 18:36
📧 Message 4 of 68

> -   parse the IRI and extract the domain name
> -   convert it to Punycode
> -   do the DNS lookup
> -   connect to the IP address and send the IRI as request

I feel like this is probably the most intuitive method. Only
use punycoding when it's a necessity, like for DNS lookups.

What about link lines though? I think that clients should
accept both punycoded and Unicode domains in links. Convert
all links' domain to punycode for DNS, then convert all links'
domain to Unicode for sending. That seems a bit complicated,
but from an author perspective it makes sense to support both.

makeworld

Link to individual message.

5. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-06 16:56
📧 Message 5 of 68

On Fri, Dec 04, 2020 at 06:36:00PM +0000,
 colecmac at protonmail.com <colecmac at protonmail.com> wrote 
 a message of 15 lines which said:

> I feel like this is probably the most intuitive method. Only
> use punycoding when it's a necessity, like for DNS lookups.

I don't know what is the process for proposing and discussing changes
in Gemini specification (or to follow the points in discussion).

In the mean time, I've summarized the discussion here:

gemini://gemini.bortzmeyer.org/gemini/idn.gmi

Link to individual message.

6. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

📅 Sent: 2020-12-06 17:11
📧 Message 6 of 68

> I don't know what is the process for proposing and discussing changes
> in Gemini specification (or to follow the points in discussion).

It's basically what we're doing. At some point Solderpunk will chime in
hopefully, and make a permanent change.

> In the mean time, I've summarized the discussion here:
>
> gemini://gemini.bortzmeyer.org/gemini/idn.gmi

Thanks for this. I forgot about certificates. I feel like for the most
compatibility, clients should support both the punycoded and Unicode
version of the domain in certs. Anyone disagree?

As for Unicode normalization, I feel like that's complex, annoying, and
hopefully out-of-scope. There should be one Unicode string for each domain
only, and I really don't want to have to deal with anything else.


Cheers
makeworld

Link to individual message.

7. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-06 17:22
📧 Message 7 of 68

On Sun, Dec 06, 2020 at 05:11:48PM +0000,
 colecmac at protonmail.com <colecmac at protonmail.com> wrote 
 a message of 21 lines which said:

> As for Unicode normalization, I feel like that's complex, annoying,
> and hopefully out-of-scope. There should be one Unicode string for
> each domain only, and I really don't want to have to deal with
> anything else.

Well, OK so let's settle with NFC for everybody? (Since this is what
RFC 5198 says.)

Link to individual message.

8. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

📅 Sent: 2020-12-06 17:30
📧 Message 8 of 68

> > As for Unicode normalization, I feel like that's complex, annoying,
> > and hopefully out-of-scope. There should be one Unicode string for
> > each domain only, and I really don't want to have to deal with
> > anything else.
>
> Well, OK so let's settle with NFC for everybody? (Since this is what
> RFC 5198 says.)

Do you mean all clients should do NFC? That seems to me like would make
Gemini quite a bit more complex. I feel like NFC should be on the user,
if that's possible. Also I wonder how often this issue actually occurs,
are users really typing U+0065 U+0301 (e?) instead of U+00E9 (?) ?

makeworld

Link to individual message.

9. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

📅 Sent: 2020-12-06 23:27
📧 Message 9 of 68

On Sunday, December 6, 2020 6:05 PM, A. E. Spencer-Reed <easrng at gmail.com> wrote:

> > are users really typing U+0065 U+0301 (e?) instead of U+00E9 (?) ?
>
> In links, probably not, but maybe in the address bar. However, I was
> under the impression that precomposed characters are no longer
> supposed to be used, am I horribly wrong?

I don't know about that, but I hope that's true. It helps reinforce my
idea that normalization is way out of scope for Gemini clients.


makeworld

(Note I've CC'ed the gemini list, I think you forgot to Reply All in your message)

Link to individual message.

10. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

📅 Sent: 2020-12-06 23:29
📧 Message 10 of 68

I'm not sure if the first email is the right one to reply to in
this case, but I've summarized the suggestions of this thread here:

https://github.com/makeworld-the-better-one/go-gemini/issues/10

I hope this makes it easier for other client authors to figure out
what to do, as well as for Solderpunk to make an official decision.


Cheers,
makeworld

Link to individual message.

11. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

📅 Sent: 2020-12-07 00:52
📧 Message 11 of 68

On Sunday, December 6, 2020 7:40 PM, A. E. Spencer-Reed <easrng at gmail.com> wrote:

> > Do you mean all clients should do NFC? That seems to me like would make
> > Gemini quite a bit more complex.
>
> Isn't that usually handled by the standard library anyway?

I suppose, yeah. Many languages might need to import a package, but it won't be
something the programmer is doing themselves, just like TLS.

I'm just wary of bringing in another large dependency, and I know Unicode to
be something complex, and something that will require updates. I would very
much like to hear Solderpunk's opinion on this.

makeworld

(You forgot to use Reply-All again, I've CC'ed the list.)

Link to individual message.

12. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

📅 Sent: 2020-12-07 01:04
📧 Message 12 of 68

> > > Do you mean all clients should do NFC? That seems to me like would make
> > > Gemini quite a bit more complex.
> >
> > Isn't that usually handled by the standard library anyway?
>
> I suppose, yeah. Many languages might need to import a package, but it won't be
> something the programmer is doing themselves, just like TLS.
>
> I'm just wary of bringing in another large dependency, and I know Unicode to
> be something complex, and something that will require updates. I would very
> much like to hear Solderpunk's opinion on this.

Having looked into it[1], it doesn't look that complicated, for Go anyway.
Perhaps it should be recommended for clients to do, but not required. While
the other things like punycoding and sending the IDN to the server would be
required.


makeworld

1: https://github.com/makeworld-the-better-one/go-gemini/issues/10#issuecomment-739604051

Link to individual message.

13. bie (bie (a) 202x.moe)

📅 Sent: 2020-12-07 02:16
📧 Message 13 of 68

> I feel like this is probably the most intuitive method. Only
> use punycoding when it's a necessity, like for DNS lookups.
> 
> What about link lines though? I think that clients should
> accept both punycoded and Unicode domains in links. Convert
> all links' domain to punycode for DNS, then convert all links'
> domain to Unicode for sending. That seems a bit complicated,
> but from an author perspective it makes sense to support both.

Allowing IRIs is a *really big and breaking change*.

Right now, checking whether a gemini request is valid can be done really
easily, even in a language like C and with no external dependencies.

Converting the IRI to an URI in the client and having the server
configured with the punycode seems like a much cleaner, simpler and even
robust solution to me.

bie

Link to individual message.

14. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

📅 Sent: 2020-12-07 03:24
📧 Message 14 of 68

bie wrote:
> Allowing IRIs is a really big and breaking change.

Stephane Bortzmeyer mentioned IRIs in an earlier email in this thread.
I think that was probably a mistake, and if not, then I don't support it.
I should have caught it at the time but I didn't, sorry. All I have been
talking about the entire time, in this thread and in the GitHub issue[1]
you quote from, is IDNs -- Internationalized Domain Names.

You're right that using IRIs over URIs would be a big change, and a bad
one. I'm only talking about converting and messing with domains. In my
opinion, to keep things simple, no client should deal with IRIs at all.

I hope that sets the record straight. I'm only talking about domains.
Thanks for allowing me to clarify that.

>> What about link lines though?

This quote from the issue has been removed now. It was in reference to how
Amfora should work interally, not what low-level clients should do. I hope the
issue is more clear now. No IRIs! :)

Cheers,
makeworld

1: https://github.com/makeworld-the-better-one/go-gemini/issues/10

Link to individual message.

15. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-07 08:12
📧 Message 15 of 68



> On Dec 7, 2020, at 00:27, colecmac at protonmail.com wrote:
> 
> (Note I've CC'ed the gemini list, I think you forgot to Reply All in your message)

Sigh.

Link to individual message.

16. Côme Chilliet (come (a) chilliet.eu)

📅 Sent: 2020-12-07 08:24
📧 Message 16 of 68

It's 2020, can we please be allowed to use french in our links?

It makes no sense I'd need to know two weird translitteration schemes by 
heart before I can link to ?checs.fr/fran?ais

I get that we need to encode spaces and other special delimiter 
characters, but other than that, what's the rationnal in limiting to ascii?
MCMic

Link to individual message.

17. bie (bie (a) 202x.moe)

📅 Sent: 2020-12-07 09:29
📧 Message 17 of 68

On Mon, Dec 07, 2020 at 09:24:08AM +0100, C?me Chilliet wrote:
> It's 2020, can we please be allowed to use french in our links?
> 
> It makes no sense I'd need to know two weird translitteration schemes by 
heart before I can link to ?checs.fr/fran?ais
> 
> I get that we need to encode spaces and other special delimiter 
characters, but other than that, what's the rationnal in limiting to ascii?
> MCMic

There is one really good reason - it won't work well with existing
servers and clients.

Most servers (and especially servers that follow the spec) currently
only accept requests that provide a valid URI - so a request that
contains something outside the set of valid 84 characters should not be
accepted. Asking for servers to start accepting IRIs is a big change,
and a breaking change in my opinion, one that adds a lot of complexity
for very little value added.

Allowing such links in text/gemini, but asking clients to handle the
percent-encoding in the background has a similar problem - it goes
against what every single client is doing now.

The best solution, in my opinion, is to stick to URIs. If someone really
wants to be able to type links like your example into their text/gemini
files, there's even a solution for that - create as server that processes
the .gmi files on the fly and sends punycode/percent-encoded links to
the client.

bie

Link to individual message.

18. Côme Chilliet (come (a) chilliet.eu)

📅 Sent: 2020-12-07 09:59
📧 Message 18 of 68

Le lundi 7 d?cembre 2020, 10:29:34 CET bie a ?crit :
> > I get that we need to encode spaces and other special delimiter 
characters, but other than that, what's the rationnal in limiting to ascii?
> > MCMic
> 
> There is one really good reason - it won't work well with existing
> servers and clients.
> 
> Most servers (and especially servers that follow the spec) currently
> only accept requests that provide a valid URI - so a request that
> contains something outside the set of valid 84 characters should not be
> accepted. Asking for servers to start accepting IRIs is a big change,
> and a breaking change in my opinion, one that adds a lot of complexity
> for very little value added.

I have to disagree that using my own language to name files and pages on 
my own server is of ?very little value?.

Servers already have to output utf-8, why not accept utf-8 in the input?

> Allowing such links in text/gemini, but asking clients to handle the
> percent-encoding in the background has a similar problem - it goes
> against what every single client is doing now.

I am not asking clients to handle percent-encoding, I expect my server to 
receive the utf-8 I put in the link.
I just tried, my server handles 
gemini://gemlog.lanterne.chilliet.eu/fran?ais-test.gmi with no problem. (I 
coded the server myself, but I did not put any effort into supporting 
this, I never tried it before).
Most clients will percent encode the request, but with lagrange if I enter 
it like this in the adress bar my server does receive the request not 
percent encoded and reacts well. (It also reacts well if percent encoded 
of course since it decodes to the same name).

> The best solution, in my opinion, is to stick to URIs. If someone really
> wants to be able to type links like your example into their text/gemini
> files, there's even a solution for that - create as server that processes
> the .gmi files on the fly and sends punycode/percent-encoded links to
> the client.

Gemini has taken steps in the right direction by defaulting to uft-8 and 
specifying that there is no default value for lang. It would make a lot of 
sense to accept utf-8 in request as well and not arbitrarly limit to 
ascii, just because of web history.

I understand that punycode will have to be used for the DNS lookup, but 
that?s on DNS specification, not Gemini?s reponsibility. But I fail to see 
why the Gemini request should be punycoded, or percent encoded except for 
special delimiter characters.

MCMic

Link to individual message.

19. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-07 10:39
📧 Message 19 of 68

On Sun, Dec 06, 2020 at 05:30:16PM +0000,
 colecmac at protonmail.com <colecmac at protonmail.com> wrote 
 a message of 15 lines which said:

> I feel like NFC should be on the user, if that's possible. Also I
> wonder how often this issue actually occurs, are users really typing
> U+0065 U+0301 (e?) instead of U+00E9 (?) ?

Users don't input Unicode code points! They input characters (using
various methods), and, behind the scene, the input methods they use
produce code points. This is typically not under the control of the
user.

Link to individual message.

20. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-07 10:44
📧 Message 20 of 68

On Sun, Dec 06, 2020 at 11:27:11PM +0000,
 colecmac at protonmail.com <colecmac at protonmail.com> wrote 
 a message of 18 lines which said:

> > However, I was under the impression that precomposed characters
> > are no longer supposed to be used, am I horribly wrong?
> 
> I don't know about that, but I hope that's true.

Quite the contrary. RFC 5198 mandates NFC, which maps many characters
to the precomposed form.

Link to individual message.

21. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-07 10:47
📧 Message 21 of 68

On Mon, Dec 07, 2020 at 03:24:25AM +0000,
 colecmac at protonmail.com <colecmac at protonmail.com> wrote 
 a message of 28 lines which said:

> Stephane Bortzmeyer mentioned IRIs in an earlier email in this thread.
> I think that was probably a mistake,

No, it wasn't. But it's true that there are two technically different
issues, the domain name and the path which, unfortunately, may require
different treatments.

>From the point of view of users, I believe it will be hard to explain
that Unicode characters are allowed in the domain name but not in the
path, or vice-versa.

Link to individual message.

22. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-07 10:50
📧 Message 22 of 68

On Mon, Dec 07, 2020 at 09:24:08AM +0100,
 C?me Chilliet <come at chilliet.eu> wrote 
 a message of 8 lines which said:

> It's 2020, can we please be allowed to use french in our links?

And it is even more important for people who use scripts like arabic,
chinese, devanageri, etc.

> what's the rationnal in limiting to ascii?

Well, properly handling IRI require to change the specification, and
also to change software. One of the points of Gemini being to be
simple to implement, this certainly requires consideration. However, it
is an issue similar to the TLS one: TLS is big and complicated
(certainly even more than Unicode) and yet Gemini *requires* it. After
all, it will typically be handled by a library, not by the guy or gal
who writes yet another Gemini server. Same thing for Unicode.

Link to individual message.

23. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-07 10:53
📧 Message 23 of 68

On Mon, Dec 07, 2020 at 06:29:34PM +0900,
 bie <bie at 202x.moe> wrote 
 a message of 29 lines which said:

> There is one really good reason - it won't work well with existing
> servers and clients.

This is a bad reason since the specification is not stabilized yet and
Gemini is basically very experimental. Nothing is cast in stone and we
don't have to maintain compatibility.

> for very little value added.

I strongly disagree. If Gemini is only for the world elites who speak
english, it is much less interesting.

Link to individual message.

24. bie (bie (a) 202x.moe)

📅 Sent: 2020-12-07 11:19
📧 Message 24 of 68

> > for very little value added.
> 
> I strongly disagree. If Gemini is only for the world elites who speak
> english, it is much less interesting.

It's not, though.

I've got servers running on international domain names, and the majority
of the pages I'm serving have Japanese characters in the paths.
This works *today* in every single gemini client I've tried, because the
paths are valid URIs (percent-encoded) and the domains work with
punycode.

Nice clients can show decoded paths and decoded domains, but the beauty
of the current approach is that they don't have to.

bie

Link to individual message.

25. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-07 12:30
📧 Message 25 of 68

On Mon, Dec 07, 2020 at 11:47:19AM +0100,
 Stephane Bortzmeyer <stephane at sources.org> wrote 
 a message of 14 lines which said:

> No, it wasn't. But it's true that there are two technically different
> issues, the domain name and the path which, unfortunately, may require
> different treatments.
> 
> From the point of view of users, I believe it will be hard to explain
> that Unicode characters are allowed in the domain name but not in the
> path, or vice-versa.

An example of an IRI issue with the Lagrange client
<https://github.com/skyjake/lagrange/issues/73>

Link to individual message.

26. marc (marcx2 (a) welz.org.za)

📅 Sent: 2020-12-07 12:30
📧 Message 26 of 68

Hi

> > It's 2020, can we please be allowed to use french in our links?
> 
> And it is even more important for people who use scripts like arabic,
> chinese, devanageri, etc.

I am to be convinced that unicode URLs are a good thing.

And I say that as a native speaker of a language which
includes glyphs which aren't in US ASCII.

An URL is an address, in the same way that a phone
number or an IP is an address. Ideally these are globally
unique, unambiguous and representable everywhere.
This address scheme should be independent of a localisation.

We don't insist that phone numbers are rendered in roman
numerals either. My dialing prefix isn't +XXVII. The
gemini:// prefix isn't tweeling:// in dutch.

Using unicode in addresses balkanises this global space into
separate little domains, with subtle ambiguities (is the
cyrilic C the same as a latin - C, who knows ?), reducing
security, and making crossover harder. If somebody points
me at an url in kanji or ethiopian, I would have great
difficulty remembering nevermind recreating it, even if the 
photo there is useful to the rest of the world. If you 
are saying what about the guy from Ethiopia - well, I suspect he
would have trouble with kanji too... without a common
denominator this is an N^2 problem.

I appreciate that many languages are in decline and even
facing extinction - but interacting with the internet requires
a jargon or specialisation anyway, in the same way that botanists
invoke latin names, mathematicians write about eigenvectors
and brain surgeons talk about the hippocampus, all regardless
of which languages they speak at home.

TLDR: The words after the gemini => link can be unicode, the
link itself should not.

regards

marc

Link to individual message.

27. bie (bie (a) 202x.moe)

📅 Sent: 2020-12-07 13:09
📧 Message 27 of 68

> > And it is even more important for people who use scripts like arabic,
> > chinese, devanageri, etc.
> 
> I am to be convinced that unicode URLs are a good thing.
> 
> And I say that as a native speaker of a language which
> includes glyphs which aren't in US ASCII.
> 
> An URL is an address, in the same way that a phone
> number or an IP is an address. Ideally these are globally
> unique, unambiguous and representable everywhere.
> This address scheme should be independent of a localisation.
> 
> We don't insist that phone numbers are rendered in roman
> numerals either. My dialing prefix isn't +XXVII. The
> gemini:// prefix isn't tweeling:// in dutch.
> 
> Using unicode in addresses balkanises this global space into
> separate little domains, with subtle ambiguities (is the
> cyrilic C the same as a latin - C, who knows ?), reducing
> security, and making crossover harder. If somebody points
> me at an url in kanji or ethiopian, I would have great
> difficulty remembering nevermind recreating it, even if the 
> photo there is useful to the rest of the world. If you 
> are saying what about the guy from Ethiopia - well, I suspect he
> would have trouble with kanji too... without a common
> denominator this is an N^2 problem.
> 
> I appreciate that many languages are in decline and even
> facing extinction - but interacting with the internet requires
> a jargon or specialisation anyway, in the same way that botanists
> invoke latin names, mathematicians write about eigenvectors
> and brain surgeons talk about the hippocampus, all regardless
> of which languages they speak at home.
> 
> TLDR: The words after the gemini => link can be unicode, the
> link itself should not.

I mostly agree with this in the sense that the protocol and text/gemini
should stick to URLs that are URI-safe (nothing outside the safe
80-something characters).

That said, I don't think there's anything wrong with a friendly client
showing percent-decoded unicode representations of a path or
punycode-decoded representations of an international domain name in the
address bar or anywhere else in the interface.

In the same vein, if a server wants to be extra friendly to gmi file
authors, it can, like I suggested earlier, allow users to name and link
to files in unicode, but percent-encode everything before sending it to
over the wire. I actually implemented this in my personal gemini server
today, and it was a trivial change (especially when compared to what I'd
have to do to properly validate IRIs...), allowing me to write "=> ??/
??" and have it sent to the client as "=> %e%9b%91%e5%bf%b5/ ??".

bie

Link to individual message.

28. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

📅 Sent: 2020-12-07 13:33
📧 Message 28 of 68

On Monday, December 7, 2020 5:44 AM, Stephane Bortzmeyer <stephane at sources.org> wrote:

> On Sun, Dec 06, 2020 at 11:27:11PM +0000,
> colecmac at protonmail.com colecmac at protonmail.com wrote
> a message of 18 lines which said:
>
> > > However, I was under the impression that precomposed characters
> > > are no longer supposed to be used, am I horribly wrong?
> >
> > I don't know about that, but I hope that's true.
>
> Quite the contrary. RFC 5198 mandates NFC, which maps many characters
> to the precomposed form.

Yep, you're right. I misread and thought this was being said about
the decomposed form.


makeworld

Link to individual message.

29. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

📅 Sent: 2020-12-07 13:56
📧 Message 29 of 68

Not sure exactly where to jump in, so I'm gonna share my thoughts here.
I thinking having IRIs would be nice, and I feel bad that currently
non-english authors are second-classed in this manner. But from the
beginning Gemini has been about being simple -- not for authors, but
for programmers. It was intended to be able to implemented in a weekend,
in a few hundred lines, ideally without even needing libraries outside
your language's stdlib.

Supporting IRIs is *not* simple. For example, in Python it requires a
third-party library[1], and in Go I wasn't even able to find one. This
means that in many programming languages, no one would be able to even
begin writing a Gemini client before writing a library that parses and
conforms to the complex specification that is IRIs.

Secondly, this would be a large breaking change for Gemini. Even if IRIs
were supported in all programming languages, I don't think making breaking
changes to Gemini is feasible at this point. Things are too set, and
attempting to do this would break the ecosystem.

Lower down in the thread, Stephane Bortzmeyer mentions:

> From the point of view of users, I believe it will be hard to explain
> that Unicode characters are allowed in the domain name but not in the
> path, or vice-versa.

This is true and unfortunate. My proposal[2] is only about domain names,
and so this would have to be explained to users. But as I've outlined
above, using IRIs would be virtually impossible, and so I think supporting
IDNs in link lines is the best we can give non-english authors.


1: https://stackoverflow.com/a/12565315/7361270
2: https://github.com/makeworld-the-better-one/go-gemini/issues/10


Thanks,
makeworld

Link to individual message.

30. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-07 14:07
📧 Message 30 of 68

> On Dec 7, 2020, at 14:09, bie <bie at 202x.moe> wrote:
> 
> That said, I don't think there's anything wrong with a friendly client
> showing percent-decoded unicode representations of a path or
> punycode-decoded representations of an international domain name in the
> address bar or anywhere else in the interface.
> 
> In the same vein, if a server wants to be extra friendly to gmi file
> authors, it can, like I suggested earlier, allow users to name and link
> to files in unicode, but percent-encode everything before sending it to
> over the wire.

This.

It's the job of the internationally minded client and server to do the 
proper legwork for the end user so the over-the-wire format is correct.

No need to change anything in the protocol itself, but rather an 
opportunity for clients and servers to distinguish themselves.

Alternatively: 

Unidecode! 
https://interglacial.com/tpj/22/
Sean M. Burke, Winter, 2001

Link to individual message.

31. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

📅 Sent: 2020-12-07 14:30
📧 Message 31 of 68

On Monday, December 7, 2020 8:56 AM, <colecmac at protonmail.com> wrote:

> Not sure exactly where to jump in, so I'm gonna share my thoughts here.
> <snip>

One last thing to add to this email:

Now that I've outlined the issues with IRIs, could we get back to talking
about the original IDN idea? Does anyone have issues with this proposal[1]?
I'm hoping to consolidate everything there, and Solderpunk can look at that
and make his decision.

1: https://github.com/makeworld-the-better-one/go-gemini/issues/10

Cheers,
makeworld

Link to individual message.

32. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-07 15:09
📧 Message 32 of 68

On Mon, Dec 07, 2020 at 08:19:07PM +0900,
 bie <bie at 202x.moe> wrote 
 a message of 17 lines which said:

> I've got servers running on international domain names, and the majority
> of the pages I'm serving have Japanese characters in the paths.
> This works *today* in every single gemini client I've tried,

I don't know which ones you tried but Amfora, AV-98, Bombadillo and
Lagrange all fail on such names.

Link to individual message.

33. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-07 15:32
📧 Message 33 of 68

On Mon, Dec 07, 2020 at 01:30:41PM +0100,
 marc <marcx2 at welz.org.za> wrote 
 a message of 45 lines which said:

> An URL is an address, in the same way that a phone number or an IP
> is an address. Ideally these are globally unique, unambiguous and
> representable everywhere.  This address scheme should be independent
> of a localisation.

This theory, in the world of domain names, is wrong. RFC 2277 says
that "protocol elements" (basically, the things the user does not see
such as the MIME type text/gemini) do not have to be
internationalized. Everything else ("text", says the RFC) must be
internationalized, simply because the world is like that, with
multiple scripts and languages. Now, identifiers, like domain names,
are a complicated case, since they are both protocol elements and
text. But, since they are widely visible (in advertisments, business
cards, etc), I believe they should be internationalized, too.

> Using unicode in addresses balkanises this global space

The english-speaking space is not a global space: it is the space of a
minority of the world population.

> subtle ambiguities (is the cyrilic C the same as a latin - C, who
> knows ?),

There is no ambiguity, U+0421 is different from  U+0043.

> reducing security,

That's false. I stil wait to see an actual phishing email with
Unicode. Most of the time, the phisher does not even bother to have a
realistic URL, they advertise <http://evil.example/famousbank> and it
works (few people check URL).

Anyway, the goal of Gemini is not to do onli banking so this is not
really an issue.

> If somebody points me at an url in kanji or ethiopian, I would have
> great difficulty remembering nevermind recreating it,

It is safe to assume that a URL in ethiopian is for people who speak
the relevant language so it is not a problem. 

> without a common denominator this is an N^2 problem.

There is no common denominator (unless someone decided that everybody
must use english but I don't remember such decision).

> but interacting with the internet requires
> a jargon or specialisation anyway, in the same way that botanists
> invoke latin names, mathematicians write about eigenvectors
> and brain surgeons talk about the hippocampus, all regardless
> of which languages they speak at home.

OK, then let's all use Hangul for URL. (It's a nice script, very
regular, so it is convenient for computer programs.)

Link to individual message.

34. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-07 15:34
📧 Message 34 of 68

On Mon, Dec 07, 2020 at 03:07:49PM +0100,
 Petite Abeille <petite.abeille at gmail.com> wrote 
 a message of 31 lines which said:

> It's the job of the internationally minded client and server to do
> the proper legwork for the end user so the over-the-wire format is
> correct.
 
> No need to change anything in the protocol itself, but rather an
> opportunity for clients and servers to distinguish themselves.

This is what gives us the current situation. There is no
interoperability because each client and server did it in a different
way, or not at all.

Since Gemini (for good reasons) have no User-Agent, no negotiation of
options, we must specify clearly how Unicode is handled or the
geminispace won't be safe for Unicode.

Link to individual message.

35. bie (bie (a) 202x.moe)

📅 Sent: 2020-12-07 15:41
📧 Message 35 of 68

> > I've got servers running on international domain names, and the majority
> > of the pages I'm serving have Japanese characters in the paths.
> > This works *today* in every single gemini client I've tried,
> 
> I don't know which ones you tried but Amfora, AV-98, Bombadillo and
> Lagrange all fail on such names.

You cut off the important part of my reply, which specifies that the
names are percent-encoded or punycoded... here are some examples, all
work in Amfora, AV-98 and Lagrange (probably Bombadillo too):

gemini://blekksprut.net/%e6%97%a5%e5%b8%b8%e9%91%91%e8%b3%9e/ (a
friendly client could choose to display this as
gemini://blekksprut.net/????)

gemini://xn--td2a.jp/ (a friendly client could choose to display this as
gemini://?.jp/)

Even in the simplest of user-agents, these URIs work, and more advanced
clients clients can choose to display them in more user-friendly ways.

bie

Link to individual message.

36. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-07 16:26
📧 Message 36 of 68

> On Dec 7, 2020, at 16:34, Stephane Bortzmeyer <stephane at sources.org> wrote:
> 
> This is what gives us the current situation. There is no
> interoperability because each client and server did it in a different
> way, or not at all.

As pointed out -and demonstrated- by <bie> multiple time, all is good... 
as long as one bothers to properly encode everything :)

And of course, clients and servers may want to go the extra length to 
facilitate the encoding for the end users.

Link to individual message.

37. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-07 16:33
📧 Message 37 of 68

> On Dec 7, 2020, at 16:32, Stephane Bortzmeyer <stephane at sources.org> wrote:
> 
> The english-speaking space is not a global space

No, but the internet plumbing is de facto US-ASCII. 

This doesn't have to be a problem if one doesn't make it so.

?How can you govern a planet which has 1,965,246 varieties of encoding?? 
? Charlie de la Gaule

Link to individual message.

38. Solene Rapenne (solene (a) perso.pw)

📅 Sent: 2020-12-07 17:01
📧 Message 38 of 68

On Mon, 7 Dec 2020 17:33:23 +0100
Petite Abeille <petite.abeille at gmail.com>:

> > On Dec 7, 2020, at 16:32, Stephane Bortzmeyer <stephane at sources.org> wrote:
> > 
> > The english-speaking space is not a global space  
> 
> No, but the internet plumbing is de facto US-ASCII. 

If you don't start somewhere, that will never improve.
 
> This doesn't have to be a problem if one doesn't make it so.
> 
> ?How can you govern a planet which has 1,965,246 varieties of encoding?? 
> ? Charlie de la Gaule

I'm pretty sure it was about cheeses in the original quote :)

Link to individual message.

39. Côme Chilliet (come (a) chilliet.eu)

📅 Sent: 2020-12-07 17:35
📧 Message 39 of 68

Hi,

Some thoughts on answers on the topic of unicode links. (I will focus on 
unicode in path rather than in domain here).

First, I wanted to point out that almost no-one uses them on the french 
Web. Some used that as an argument against having unicode in URIs, but I 
think no one uses them because of the punycode and percent encoding weirdness.

I read part of the RFC 3987 (IRI) and part of RFC 3986 (URI) and still do 
not understand what is the horrible added complexity you are talking about.
Could people asserting IRI is a complex hell impossible to implement point 
to the exact problems with IRI?

Here is the life cycle of a link in a page:

1 - The author writes it
2 - The server saves it
3 - A client requires the page to the server
4 - The server sends it
5 - The client display it
6 - The user click it
7 - The client resolve the hostname
8 - The client sends it as request to the server
9 - The server fetches the associated page

I think we can safely assume that the author will not write percent 
encoding without help.

So, with bie suggestions that clients and servers help by 
percent-encoding, but the author/user only have to deal with unicode, it means:
1 - somewhere between step 1 and step 4, the server have to percent-encode the link
2 - somewhere between step 4 and 5 the client needs to decode it
3 - In 8 either the client stored the encoded link or has to reencode 
again, or if someone copy/paste he has to reencode.
4 - In 9 the server needs to decode it to get the real target path

If we just use the utf-8 path all along, points 1 throuh 3 are not needed. 
4 still is, because some links will still be percent encoded and the 
server needs to understand them.

> Petite Abeille <petite.abeille at gmail.com>:
> No, but the internet plumbing is de facto US-ASCII. 

If this is true, why bother with responses in utf-8?

Regarding the breaking change argument, I think it is a bit weak, testing 
shows there is no consistency in how different clients/servers handles unicode currently.

> bie:
> I actually implemented this in my personal gemini server
> today, and it was a trivial change (especially when compared to what I'd0
> have to do to properly validate IRIs...), allowing me to write "=> ??/
> ??" and have it sent to the client as "=> %e%9b%91%e5%bf%b5/ ??".

If you are all this confortable with links that looks like 
?%e%9b%91%e5%bf%b5? lets go the whole way and percent-encode ascii as well.
Let?s see how long before you change your mind after using this kind of 
stuff on a daily basis. And at least this would put all languages at the same point.
	
> colecmac at protonmail.com
> Supporting IRIs is *not* simple. For example, in Python it requires a
> third-party library[1], and in Go I wasn't even able to find one. This
> means that in many programming languages, no one would be able to even
> begin writing a Gemini client before writing a library that parses and
> conforms to the complex specification that is IRIs.

On the server I wrote in PHP, getting a request in UTF-8 worked without me 
doing anything for it. Not accepting IRI would actually require me to 

	add* code, it seems. (again I might have missed a whole lot of edgecases IRI)

In these languages, it means they are actively checking for non-ascii 
characters? Or are they using string format which is not compatible with utf-8?
They need to speak UTF-8 for the response anyway.
I get that *validating* an IRI might be hard, but is it the job of the 
server to validate it? I just use whatever input is thrown at me and suppose it is valid.
(Note that these are real non-rethorical questions, I?m not trying to deny 
that handling IRI would be hard, I?m trying to understand why)

(On a more general note, I guess everyone understood english is not my 
mother tongue, sorry if I?m being rude or something like that, I?m not 
trying to. I just really believe using utf-8 here would be better, but I 
understand there are complex technical questions involved)

MCMic

Link to individual message.

40. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

📅 Sent: 2020-12-07 18:00
📧 Message 40 of 68

C?me's reply here asserts that a client would never need to parse
IRIs, and so there's no added complexity. Just copy the IRI from the
link line, do DNS, and send the IRI to the server. But this is not
true, a client would need to do parsing.

What parsing would a client have to do?

- Extracting the domain, so it can be punycoded for DNS lookups
- Resolving relative IRIs would require parsing the current IRI,
   and the provided one, and combining them. You cannot just copy it
   to make the request.
- When receiving an input status code on a page that already has a
   query string, the IRI has to be parsed to detect that there is a
   query string, and then remove and replace it with the new input of
   the user.
- Extracting the path to get a name for downloading files
- Etc.

There are many reasons why a client would need to be able to parse an
IRI, the relative link one and DNS one being the most important.

This would then require IRI parsing libraries, and as I have explained
earlier, these don't exist in likely many programming languages, and
when they do, they are third-party.

For this reason, as well as the previously stated reason of this being
a large breaking change, I can't support a switch to IRIs.

IDNs, on the other hand... :)

Cheers,
makeworld

Link to individual message.

41. Adnan Maolood (me (a) adnano.co)

📅 Sent: 2020-12-07 18:13
📧 Message 41 of 68

On Mon Dec 7, 2020 at 7:30 AM EST, marc wrote:
> Using unicode in addresses balkanises this global space into
> separate little domains, with subtle ambiguities (is the
> cyrilic C the same as a latin - C, who knows ?), reducing
> security, and making crossover harder.

I don't think that using unicode in addresses would decrease security
because of the way that Gemini handles client authentication. Since
client certificates are limited to certain domains and paths, the
certificate will never be applied to the wrong domain, even if it looks
the same to the user.

Link to individual message.

42. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-07 19:44
📧 Message 42 of 68



> On Dec 7, 2020, at 18:01, Solene Rapenne <solene at perso.pw> wrote:
> 
>> No, but the internet plumbing is de facto US-ASCII. 
> 
> If you don't start somewhere, that will never improve.

Would it make it less controversial if we refer to it as ISO-IR-006 encoding? :D

Link to individual message.

43. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-07 20:23
📧 Message 43 of 68

> On Dec 7, 2020, at 18:35, C?me Chilliet <come at chilliet.eu> wrote:
> 
> If this is true, why bother with responses in utf-8?

The response has a textual part at time, which is UTF-8 encoded. Assuming 
a 2x response code,  the content itself is defined by its content-type, 
which can be anything, in any encoding, following any relevant  
convention. UTF-8 (aka Universal Coded Character Set Transformation Format 
? 8-bit) itself is an encoding of Unicode. There is no such things as plain text in 2020.

But this is about URLs, no? 

As long as gemini follows established standards, then one must deal with 
encodings as defined by those standards.

Not sure why this is controversial. The tooling exists. No one writes 
UTF-8 by hand. Ditto for URLs encoding/decoding.

Use the Tools, Luke.

P.S.
There is perhaps a bit of a sleight of hands running through gemini's 
rhetoric about how "simple" everything is. But nothing is *that* simple 
once one looks at the details. The rabbit hole runs deep. Rome was not 
build in one day. Nor are gemini's foundations. 

P.P.S.
For entertainment purpose, the DNS RFC dependency graph [pdf]:
https://emaillab.jp/wp/wp-content/uploads/2017/11/RFC-DNS.pdf

Link to individual message.

44. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-07 20:36
📧 Message 44 of 68

> On Dec 7, 2020, at 18:35, C?me Chilliet <come at chilliet.eu> wrote:
> 
> First, I wanted to point out that almost no-one uses them on the french 
Web. Some used that as an argument against having unicode in URIs, but I 
think no one uses them because of the punycode and percent encoding weirdness.

Your very own email address is a good example of where tooling makes a difference.

It nicely reads as C?me Chilliet <come at chilliet.eu> -with accent 
circonflexe & all- but of course is ISO-IR-006 encoded under the hood as 
=?ISO-8859-1?Q?C=F4me?= Chilliet <come at chilliet.eu>.

I suspect you didn't type the encoding by hand, nor thing about it twice. 
It "just" works :)

Link to individual message.

45. Scot (gmi1 (a) scotdoyle.com)

📅 Sent: 2020-12-07 21:00
📧 Message 45 of 68

On 12/7/20 12:00 PM, colecmac at protonmail.com wrote:
> What parsing would a client have to do?
>
> - Extracting the domain, so it can be punycoded for DNS lookups
>

Can we be sure gemini host resolution will always use the global DNS?

Section 4 of RFC 6055 cautions against assuming that all name resolution
is using the global DNS and therefore that querying with punycode
domain names will succeed:

 ? It is inappropriate for an application that calls a general-purpose
   name resolution library to convert a name to an A-label unless the
   application is absolutely certain that, in all environments where the
   application might be used, only the global DNS that uses IDNA
   A-labels actually will be used to resolve the name.

Conversely, querying with utf8 domain names fails on Ubuntu 20.04
using systemd-resolved [1].

Some languages/libraries such as Python convert utf8 requests to
punycode silently before submitting the request to the resolver [2].

[1] C program fails without punycode conversion
#include <netdb.h>
#include <stdio.h>
#include <arpa/inet.h>
#include <netinet/in.h>
#include <sys/socket.h>
int show_ip(char *name) {
   struct hostent *entry;
   entry = gethostbyname(name);
   if (entry) {
     printf("name '%s' has ip address\n", entry->h_name);
     printf("ip: %s\n\n",inet_ntoa(*(struct in_addr*)entry->h_name));
   } else {
     printf("error querying '%s': %s\n", name, hstrerror(h_errno));
   }
}
int main() {
   show_ip("xn--td2a.jp");
   show_ip("?.jp");
}

[2] Python program succeeds with *implicit* punycode conversion
import socket
def show_ip(name):
   print("name '%s' has ip '%s'" % (name, (socket.gethostbyname(name))))
show_ip('xn--td2a.jp')
show_ip('?.jp')

Link to individual message.

46. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-07 21:20
📧 Message 46 of 68

> On Dec 7, 2020, at 19:00, colecmac at protonmail.com wrote:
> 
> IDNs, on the other hand... :)

The "Internationalized Domain Names (IDN) FAQ" makes for entertaining reading:

https://unicode.org/faq/idn.html

Special mention of our very own St?phane Bortzmeyer under "Doesn't the 
removal of symbols and punctuation in IDNA2008 help security?":

Le hame?onnage n'a pas de rapport avec les IDN
https://www.bortzmeyer.org/idn-et-phishing.html

(short answer: no)

All encrypted in French sadly :P

Happy hame?onnage.

Fun, fun, fun.

Link to individual message.

47. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

📅 Sent: 2020-12-07 21:46
📧 Message 47 of 68

On Monday, December 7, 2020 4:00 PM, Scot <gmi1 at scotdoyle.com> wrote:

> On 12/7/20 12:00 PM, colecmac at protonmail.com wrote:
>
> > What parsing would a client have to do?
> >
> > -   Extracting the domain, so it can be punycoded for DNS lookups
>
> Can we be sure gemini host resolution will always use the global DNS?
>
> Section 4 of RFC 6055 cautions against assuming that all name resolution
> is using the global DNS and therefore that querying with punycode
> domain names will succeed:
>
> ? It is inappropriate for an application that calls a general-purpose
>   name resolution library to convert a name to an A-label unless the
>   application is absolutely certain that, in all environments where the
>   application might be used, only the global DNS that uses IDNA
>   A-labels actually will be used to resolve the name.

That's interesting, thanks for sharing. However, it seems obvious to me
that punycoding is a necessity, since the global DNS system won't work
without it. I've worked with offline mesh network systems, but never
had to handle Unicode domain names. However, all of our stack was software
that was intended to work on the Internet, as well as any other network.
Standard DNS servers, standard OS and stdlib DNS resolvers, etc. So
punycoding would be the right way to do it in that network too.

Despite what this RFC says, I don't see what situation would actually
completely fail on punycoded domains. I guess the spec could mandate trying
with punycode first, than Unicode, but that seems needless to me. Do you
have an example of a system/network that fails on punycode?

> Conversely, querying with utf8 domain names fails on Ubuntu 20.04
> using systemd-resolved [1].

Yep, that's what I meant when I called it a necessity.

> Some languages/libraries such as Python convert utf8 requests to
> punycode silently before submitting the request to the resolver [2].

That's pretty handy, but it doesn't change my advice. The spec can state
that all domains must by punycoded for DNS, and maybe your library
will handle that or not. Even if an unware Pythonista manually punycodes
the domain, nothing bad will happen when the library tries again.

Cheers,
makeworld

Link to individual message.

48. Petite Abeille (petite.abeille (a) gmail.com)

Subject Changed! New Subject: Crocker's Rules (was Re: IDN with Gemini?)
📅 Sent: 2020-12-07 21:49
📧 Message 48 of 68

> On Dec 7, 2020, at 18:35, C?me Chilliet <come at chilliet.eu> wrote:
> 
> (On a more general note, I guess everyone understood english is not my 
mother tongue, sorry if I?m being rude or something like that, I?m not 
trying to. I just really believe using utf-8 here would be better, but I 
understand there are complex technical questions involved)

(hopefully) this space operates under the so-called "Crocker's Rules"*:

(perhaps) worthwhile quoting in full:

Declaring yourself to be operating by "Crocker's Rules" means that other 
people are allowed to optimize their messages for information, not for 
being nice to you.  Crocker's Rules means that you have accepted full 
responsibility for the operation of your own mind - if you're offended, 
it's your fault.  Anyone is allowed to call you a moron and claim to be 
doing you a favor.  (Which, in point of fact, they would be.  One of the 
big problems with this culture is that everyone's afraid to tell you 
you're wrong, or they think they have to dance around it.)  Two people 
using Crocker's Rules should be able to communicate all relevant 
information in the minimum amount of time, without paraphrasing or social 
formatting.  Obviously, don't declare yourself to be operating by 
Crocker's Rules unless you have that kind of mental discipline.
Note that Crocker's Rules does not mean you can insult people; it means 
that other people don't have to worry about whether they are insulting 
you.  Crocker's Rules are a discipline, not a privilege.  Furthermore, 
taking advantage of Crocker's Rules does not imply reciprocity.  How could 
it?  Crocker's Rules are something you do for yourself, to maximize 
information received - not something you grit your teeth over and do as a favor.

"Crocker's Rules" are named after Lee Daniel Crocker.

http://sl4.org/crocker.html
https://en.wikipedia.org/wiki/Lee_Daniel_Crocker

	Link curtesy of St?phane Bortzmeyer. Thanks St?phane.

Link to individual message.

49. Sean Conner (sean (a) conman.org)

Subject Changed! New Subject: IDN with Gemini?
📅 Sent: 2020-12-07 23:37
📧 Message 49 of 68

It was thus said that the Great C?me Chilliet once stated:
> Hi,
> 
> Some thoughts on answers on the topic of unicode links. (I will focus on
> unicode in path rather than in domain here).
> 
> First, I wanted to point out that almost no-one uses them on the french
> Web. Some used that as an argument against having unicode in URIs, but I
> think no one uses them because of the punycode and percent encoding
> weirdness.
> 
> I read part of the RFC 3987 (IRI) and part of RFC 3986 (URI) and still do
> not understand what is the horrible added complexity you are talking
> about. Could people asserting IRI is a complex hell impossible to
> implement point to the exact problems with IRI?

  I'm reading through RFC-3987, and sections 4 and 5 give me pause.  Section
4 relates to bidirectional IRIs (right-to-left languages).  This is mostly a
client issue (I think) with the displaying of such.

  Section 5 is the scarier of the two---normalization and comparison, and
would most likely affect servers than clients (again, I think).  There are
two examples given:

	http://www.example.org/r&#xE9;sum&#xE9;.html
	http://www.example.org/re&#x301;sume&#x301;.html

  The first uses a precomposed character and the second uses a combining
character.  I'm looking at the Unicode normalization standard [1], and the
first thing that struck me was I had *not* thought of the order of multiple
combining characters.  Oh, there's also Hangul and conjoining jamo.  And
then ... well, I'll spare the horrors of that 32k document, but the upshot
is---yes, that's yet *another* library I have to track down (and update as
the Unicode standard is regularly updated).

  Also, related question---what's the filename on the server?

  The "horrible added complexity" is not RFC-3987 per se, but the "horrible
added complexity" of Unicode normalization that is required.  Is that a
valid excuse?  Perhaps not.  But there *is* the issue that a lot of people
are having with Python 3 and filenames.  If you hit a filename that isn't
UTF-8 and the Python 3 script breaks badly.  Yes, there *is* a link in my
mind between these two issues but I'm not sure I can verbalize it coherently
at this time.  Perhaps "I will focus on unicode in the path" reminided me of
the Python 3 issue.

> Regarding the breaking change argument, I think it is a bit weak, testing
> shows there is no consistency in how different clients/servers handles
> unicode currently.

 ...

> (Note that these are real non-rethorical questions, I?m not trying to deny
> that handling IRI would be hard, I?m trying to understand why)

  Methinks you inadverantly answered your own question---Unicode is *not*
easy [1][2][3][4].

  -spc

[1]	https://www.unicode.org/reports/tr15/tr15-50.html

[2]	https://www.unicode.org/reports/tr9/tr9-42.html

[3]	https://www.unicode.org/reports/tr14/tr14-45.html

[4]	Among others.  The full current standard:

	http://www.unicode.org/versions/Unicode13.0.0/

Link to individual message.

50. Côme Chilliet (come (a) chilliet.eu)

📅 Sent: 2020-12-07 23:40
📧 Message 50 of 68

Le lundi 7 d?cembre 2020, 19:00:02 CET colecmac at protonmail.com a ?crit :
> C?me's reply here asserts that a client would never need to parse
> IRIs, and so there's no added complexity. Just copy the IRI from the
> link line, do DNS, and send the IRI to the server. But this is not
> true, a client would need to do parsing.
> 
> What parsing would a client have to do?
> 
> - Extracting the domain, so it can be punycoded for DNS lookups

True, thanks for pointing that out.

> - Resolving relative IRIs would require parsing the current IRI,
>    and the provided one, and combining them. You cannot just copy it
>    to make the request.

Also true, but it should be the same value that is extracted for DNS.

> - When receiving an input status code on a page that already has a
>    query string, the IRI has to be parsed to detect that there is a
>    query string, and then remove and replace it with the new input of
>    the user.

Good to know, I did not think of query string situation.

> - Extracting the path to get a name for downloading files
> - Etc.
> 
> There are many reasons why a client would need to be able to parse an
> IRI, the relative link one and DNS one being the most important.
> 
> This would then require IRI parsing libraries, and as I have explained
> earlier, these don't exist in likely many programming languages, and
> when they do, they are third-party.

 From what you said on irc, the situation is different between URI and IRI 
because most languages have URI parsing either in their stdlib or in a 
well tested known library.
But, if no project use IRI, of course no one will write a library for it, 
this is a chicken and egg situation here.

Also, for the purpose of a client, it seems to me the parsing needed 
(domain and query extraction) is only to search for the first "/" and the 
last "?", and some minor tweaks on the scheme maybe (which does not 
contain unicode, I will leave the scheme alone, promise).

Note: Just tried 
gemini://gemini.circumlunar.space/%64%6f%63%73/%66%61%71%2e%67%6d%69 in 
lagrange, it does work.

C?me

Link to individual message.

51. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

📅 Sent: 2020-12-07 23:54
📧 Message 51 of 68

Glad to hear that you realize how it's more complex than you originally
thought.

> > -   Resolving relative IRIs would require parsing the current IRI,
> >     and the provided one, and combining them. You cannot just copy it
> >     to make the request.
> >
>
> Also true, but it should be the same value that is extracted for DNS.

No. I'm referring to things like this:

=> /docs/
=> example.gmi
=> dir/test/foo.gmi
=> //gus.guru/

These are all relative in some way, and they must be resolved in reference
to the IRI (or for Gemini right now, the URI) of the current page. This is not
the same as the domain that was extracted for DNS, and requires a full
parser.

> From what you said on irc, the situation is different between URI and IRI
> because most languages have URI parsing either in their stdlib or in a well
> tested known library. But, if no project use IRI, of course no one will
> write a library for it, this is a chicken and egg situation here.

Yep, it's a shame. But we must live with it, and so URIs are the way forward.

> Also, for the purpose of a client, it seems to me the parsing needed
> (domain and query extraction) is only to search for the first "/" and the last
> "?", and some minor tweaks on the scheme maybe (which does not contain unicode,
> I will leave the scheme alone, promise).

It's always more complex than that. I'm a bit too tired to go dig into the RFCs
to prove it right now, but I would not trust software that just matches some
characters instead of compliantly parsing things in their entirety. This method
would make Gemini more complex and easily introduce bugs. If we use URIs, we don't
have to resort to this.

> Note: Just tried gemini://gemini.circumlunar.space/%64%6f%63%73/%66%61%71%2e%67%6d%69
> in lagrange, it does work.

Works in Amfora too, and note that also the server software (Molly Brown) is accepting
and parsing it correctly into a file path. But that's expected, because it's
perfectly valid to percent-encode ASCII in a URL path.

Cheers,
makeworld

Link to individual message.

52. Sean Conner (sean (a) conman.org)

📅 Sent: 2020-12-08 00:04
📧 Message 52 of 68

It was thus said that the Great C?me Chilliet once stated:
> Le lundi 7 d?cembre 2020, 19:00:02 CET colecmac at protonmail.com a ?crit :
> > 
> > This would then require IRI parsing libraries, and as I have explained
> > earlier, these don't exist in likely many programming languages, and
> > when they do, they are third-party.
> 
> From what you said on irc, the situation is different between URI and IRI
> because most languages have URI parsing either in their stdlib or in a
> well tested known library. But, if no project use IRI, of course no one
> will write a library for it, this is a chicken and egg situation here.

  I'm looking at RFC-3987 [1] and the changes from RFC-3986 [2] are minimal,
and it would be easy to modify my own URI parsing library [3] (which is
based directly off the BNF of RFC-3986) but that only gets me so far.  The
other issue is Unicode normalization and punycode support, both of which I
would have to track down existing libraries or (and I shudder to think this)
write my own.

> Also, for the purpose of a client, it seems to me the parsing needed
> (domain and query extraction) is only to search for the first "/" and the
> last "?", and some minor tweaks on the scheme maybe (which does not
> contain unicode, I will leave the scheme alone, promise).

  And then do some Unicode normalization to match how filenames are stored
on your server:

	http://www.example.org/r&#xE9;sum&#xE9;.html
	http://www.example.org/re&#x301;sume&#x301;.html

  -spc

[1]	https://tools.ietf.org/html/rfc3987

[2]	https://tools.ietf.org/html/rfc3986

[3]	https://github.com/spc476/LPeg-Parsers/blob/master/url.lua

Link to individual message.

53. Philip Linde (linde.philip (a) gmail.com)

📅 Sent: 2020-12-08 00:18
📧 Message 53 of 68

On Mon, 07 Dec 2020 18:13:17 GMT
"Adnan Maolood" <me at adnano.co> wrote:

> I don't think that using unicode in addresses would decrease security
> because of the way that Gemini handles client authentication. Since
> client certificates are limited to certain domains and paths, the
> certificate will never be applied to the wrong domain, even if it looks
> the same to the user.

Security might also mean knowing that I'm not unintendedly divulging
any details of my browsing habits to some unknown third party, or
knowing that no one can impersonate your server and pages to mislead
your readers. Neither of those necessarily involve client certificates
at all but are a real possibility when multiple code points can
represent similar or identical glyphs.

I can see how IRI and IDN may be a good idea in terms of including
languages with bigger or altogether different alphabets or
non-alphabets, but from the perspective of an implementer, it does add
a lot of complexity and opens up to homograph attacks in an area where
ASCII transliteration is already the norm. Some browsers deal with
homograph attacks by displaying punycode directly based on some basic
heuristic (e.g. when a hostname contains both cyrillic and latin codes).

I don't know much about IRI. Web browsers for example sort of skipped on
this standard in favor of the WHATWG URL spec.

Personally, I think some concessions need to be made to maintain the
simplicity of the protocol. The currently mandated standard is
(relatively) short and simple to implement, and transliteration is
already pervasive in the area of internet names and URIs. Octet
encoded ASCII does have the nice property that there are no homographs,
there's no normalization, there's nobidirectional text etc. and there
is no database of rules that have to be applied to handle these things.

That said, I think a lot can be improved on the client side without
changing the standard. Clients can optionally do the ToASCII/ToUnicode
dance and correspondingly automatically percent encode input and display
"un-percented" paths in some circumstances. The standard only specifies
what needs to be sent to the server to request a resource, and what
text/gemini documents need to contain to produce a link. This opens up
a lot of quality of life improvements on the user interface level.

RFC 4690 is a good read on the topic of IDNs.

-- 
Philip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201208/55da
d39f/attachment.sig>

Link to individual message.

54. Scot (gmi1 (a) scotdoyle.com)

📅 Sent: 2020-12-08 02:21
📧 Message 54 of 68

On 12/7/20 3:46 PM, colecmac at protonmail.com wrote:
> On Monday, December 7, 2020 4:00 PM, Scot<gmi1 at scotdoyle.com>  wrote:
>
>> On 12/7/20 12:00 PM,colecmac at protonmail.com  wrote:
>>
>>> What parsing would a client have to do?
>>>
>>> -   Extracting the domain, so it can be punycoded for DNS lookups
>> Can we be sure gemini host resolution will always use the global DNS?
>>
>> Section 4 of RFC 6055 cautions against assuming that all name resolution
>> is using the global DNS and therefore that querying with punycode
>> domain names will succeed:
>>
>>  ? It is inappropriate for an application that calls a general-purpose
>>    name resolution library to convert a name to an A-label unless the
>>    application is absolutely certain that, in all environments where the
>>    application might be used, only the global DNS that uses IDNA
>>    A-labels actually will be used to resolve the name.
> ... Do you have an example of a system/network that fails on punycode?
>
Yes, an organization's internal network resolver or a user's local
resolver could reply to utf8 queries but not punycode queries.

For example, adding the line:

 ? 10.99.99.1? ??.jp

to /etc/hosts on Ubuntu 20.04 with resolver systemd-resolved
and running the test program [1] gives this output:

 ? error querying 'xn--td2aa.jp': Unknown server error

 ? name '??.jp' has ip address 10.99.99.1


[1]
#include <netdb.h>
#include <stdio.h>
#include <arpa/inet.h>
#include <netinet/in.h>
#include <sys/socket.h>

int show_ip(char *name) {
 ? struct hostent *entry;
 ? entry = gethostbyname(name);
 ? if (entry) {
 ??? printf("name '%s' has ip address %s\n\n", entry->h_name,
 ?????????? inet_ntoa(*((struct in_addr*)entry->h_addr)));
 ? } else {
 ??? printf("error querying '%s': %s\n\n", name, hstrerror(h_errno));
 ? }
}

int main() {
 ? show_ip("xn--td2aa.jp");
 ? show_ip("??.jp");
}

Link to individual message.

55. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

📅 Sent: 2020-12-08 02:38
📧 Message 55 of 68

> > > ... Do you have an example of a system/network that fails on punycode?
>
> Yes, an organization's internal network resolver or a user's local
> resolver could reply to utf8 queries but not punycode queries.
>
> For example, adding the line:
>
> ? 10.99.99.1? ??.jp
>
> to /etc/hosts on Ubuntu 20.04 with resolver systemd-resolved
> and running the test program [1] gives this output:
>
> ? error querying '??.jp': Unknown server error
>
> ? name '??.jp' has ip address 10.99.99.1

Thanks for the example, although it seems very contrived to me. Firefox
will punycode the domain right after putting it into the address bar,
for example, so any network that wants to support web browsing must use
a punycoded version. I'm sure there are many other pieces of software
that do the same.

Your example doesn't really convince me that a Gemini browser is going
to encounter a situation where doing a lookup using the punycoded domain
name will be the wrong thing to do. It's not literally impossible for
that to be the case, but I don't really see it being an issue at all.


makeworld

Link to individual message.

56. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-08 10:27
📧 Message 56 of 68

On Tue, Dec 08, 2020 at 01:18:07AM +0100,
 Philip Linde <linde.philip at gmail.com> wrote 
 a message of 69 lines which said:

> homograph attacks

Homograph attacks are basically a good way to make an english-speaking
audience laugh when you show them funny Unicode problems (I've seen
that several times in several meetings: the languages and scripts of
other people are always funny). No bad guy use them in real life,
probably because users typically never check the URI or IRI.

And they exist with ASCII, too (goog1e.com...)

> Some browsers deal with homograph attacks by displaying punycode
> directly based on some basic heuristic (e.g. when a hostname
> contains both cyrillic and latin codes).

Which is awful for the UX. Note that such mangling is never done for
ASCII, which clearly shows a provincial bias toward english.

> Octet encoded ASCII does have the nice property that there are no
> homographs, there's no normalization,

This is not true. Since percent-encoding encodes bytes, there are
still several ways to represent "the same" string of characters and
therefore normalization remains an issue.

> RFC 4690 is a good read on the topic of IDNs.

No, it is a one-sided anti-internationalization rant.

Link to individual message.

57. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-08 10:29
📧 Message 57 of 68

On Mon, Dec 07, 2020 at 06:00:02PM +0000,
 colecmac at protonmail.com <colecmac at protonmail.com> wrote 
 a message of 32 lines which said:

> What parsing would a client have to do?
...
> This would then require IRI parsing libraries, and as I have explained
> earlier, these don't exist in likely many programming languages, and
> when they do, they are third-party.

For Python (a common programming language), this is not true, standard
library's urlparse has no problem:

% ./test-urlparse.py gemini://g?meaux.bortzmeyer.org:8965/caf?\?foo=bar
Host name: g?meaux.bortzmeyer.org
Port: 8965
Path: /caf?
Query: foo=bar

% cat test-urlparse.py
#!/usr/bin/env python3

import sys
import urllib.parse

for url in sys.argv[1:]:
    components = urllib.parse.urlparse(url)
    print("Host name: %s" % components.hostname)
    if components.port is not None:
        print("Port: %s" % components.port)
    print("Path: %s" % components.path)
    if components.query != "":
        print("Query: %s" % components.query)

Link to individual message.

58. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-08 10:32
📧 Message 58 of 68

On Mon, Dec 07, 2020 at 09:46:06PM +0000,
 colecmac at protonmail.com <colecmac at protonmail.com> wrote 
 a message of 49 lines which said:

> Despite what this RFC says, I don't see what situation would actually
> completely fail on punycoded domains. I guess the spec could mandate trying
> with punycode first, than Unicode, but that seems needless to me. Do you
> have an example of a system/network that fails on punycode?

mDNS (used in Apple's Bonjour). Despite its name, it has little to do
with DNS, and it requires UTF-8 (and does not use Punycode).

gemini://gemini.bortzmeyer.org/rfc-mirror/rfc6762.txt

Link to individual message.

59. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-08 10:37
📧 Message 59 of 68

On Mon, Dec 07, 2020 at 06:37:51PM -0500,
 Sean Conner <sean at conman.org> wrote 
 a message of 69 lines which said:

> The "horrible added complexity" is not RFC-3987 per se, but the
> "horrible added complexity" of Unicode normalization that is
> required. [...]  Methinks you inadverantly answered your own
> question---Unicode is *not* easy

It would be hard to claim that Unicode is easy :-) But, to be fair,
the complexity is in human scripts (for instance the
lowercase/uppercase difference, which creates a lot of
problems). Unicode just reflects this complexity of human writings.

Link to individual message.

60. Philip Linde (linde.philip (a) gmail.com)

📅 Sent: 2020-12-08 11:42
📧 Message 60 of 68

On Tue, 8 Dec 2020 11:29:24 +0100
Stephane Bortzmeyer <stephane at sources.org> wrote:

> For Python (a common programming language), this is not true, standard
> library's urlparse has no problem:

Similar results in Go:


--- code
package main

import (
	"fmt"
	"net/url"
	"os"
)

func main() {
	for _, arg := range os.Args[1:] {
		u, err := url.Parse(arg)
		if err != nil {
			panic(err)
		}
		fmt.Printf("%q %q %q\n", u.Hostname(), u.Path, u.Query)
	}
}
---

However, this still leaves the problem of punycoding and worse,
normalization, to some other piece of code. In Go, normalization is in
the text package. ToASCII/ToUnicode implementations are in 
golang.org/x/net/idna

Not sure if Python will normalize by default.

-- 
Philip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201208/fc40
4c5c/attachment-0001.sig>

Link to individual message.

61. Gary Johnson (lambdatronic (a) disroot.org)

📅 Sent: 2020-12-08 17:31
📧 Message 61 of 68

As yet another data point, Java's standard library contains a class
(java.net.URI) that correctly parses URIs with non-ASCII characters in
their paths and query params, but it chokes when they are in the domain
name.

Therefore, URIs like this should work fine with Space Age:

gemini://gemeaux.bortzmeyer.org:8965/caf??foo?y=bar?y

But this is a non-starter:

gemini://g?meaux.bortzmeyer.org:8965/caf??foo?y=bar?y

It looks like there is an incomplete and poorly documented
implementation of RFC 3987 (IRI) and RFC 3986 (URI) in Apache Jena
(https://jena.apache.org/documentation/notes/iri.html), but it's a
rather heavyweight addition to an otherwise very concise Gemini server.

I'll keep an eye on this thread to see what the community ultimately
decides to do about IRI/IDN.

Happy hacking,
  Gary

-- 
GPG Key ID: 7BC158ED
Use `gpg --search-keys lambdatronic' to find me
Protect yourself from surveillance: https://emailselfdefense.fsf.org
=======================================================================
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments

Why is HTML email a security nightmare? See https://useplaintext.email/

Please avoid sending me MS-Office attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html

Link to individual message.

62. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

📅 Sent: 2020-12-08 18:16
📧 Message 62 of 68

So far Python, Go, and Java libs have been mentioned as sort
of working with IRI.

It's cool the Python and Go ones seem to work, but I wouldn't
trust them, because they aren't intended to support IRIs. The
term IRI appears nowhere in either of their docs, and so there
could easily be subtle bugs. The Java one is literally called
URI, and as Gary Johnson explained, it has issues.

However, these issues are all irrelevant in the face two things:

- This is the IDN thread, there's a separate thread for IRI :)
- IRIs would be a breaking change to Gemini, which is not
   feasible or a good idea.


Cheers,
makeworld

Link to individual message.

63. Sean Conner (sean (a) conman.org)

📅 Sent: 2020-12-08 22:13
📧 Message 63 of 68

It was thus said that the Great Stephane Bortzmeyer once stated:
> On Tue, Dec 08, 2020 at 01:18:07AM +0100,
>  Philip Linde <linde.philip at gmail.com> wrote 
>  a message of 69 lines which said:
> 
> > homograph attacks
> 
> Homograph attacks are basically a good way to make an english-speaking
> audience laugh when you show them funny Unicode problems (I've seen
> that several times in several meetings: the languages and scripts of
> other people are always funny). No bad guy use them in real life,
> probably because users typically never check the URI or IRI.

  True, there's no need currently for homograph attacks if other, simpler
means are available.

> And they exist with ASCII, too (goog1e.com...)

  True.  But a more concerning attack is bitsquatting [1], a much harder
attack to thwart.  Is it widely used?  Hard to say actually.

> > Some browsers deal with homograph attacks by displaying punycode
> > directly based on some basic heuristic (e.g. when a hostname
> > contains both cyrillic and latin codes).
> 
> Which is awful for the UX. Note that such mangling is never done for
> ASCII, which clearly shows a provincial bias toward english.
> 
> > Octet encoded ASCII does have the nice property that there are no
> > homographs, there's no normalization,
> 
> This is not true. Since percent-encoding encodes bytes, there are
> still several ways to represent "the same" string of characters and
> therefore normalization remains an issue.

  Yes, but by "normalization" they mean precomosed characters (like
"\u{00E9}") vs. combining characters (like "e\u{0301}"), along with the
ordering of consecutive combining characters.

> > RFC 4690 is a good read on the topic of IDNs.
> 
> No, it is a one-sided anti-internationalization rant.

  Aside from the "internationalization is hard", what's so bad about the
document?  Remember, they *are* (or *were*) trying to retrofit
internationalization into protocols that were never designed for it.

  -spc

[1]	http://www.dinaburg.org/bitsquatting.html

Link to individual message.

64. Sean Conner (sean (a) conman.org)

📅 Sent: 2020-12-08 22:20
📧 Message 64 of 68

It was thus said that the Great Stephane Bortzmeyer once stated:
> On Mon, Dec 07, 2020 at 09:46:06PM +0000,
>  colecmac at protonmail.com <colecmac at protonmail.com> wrote 
>  a message of 49 lines which said:
> 
> > Despite what this RFC says, I don't see what situation would actually
> > completely fail on punycoded domains. I guess the spec could mandate trying
> > with punycode first, than Unicode, but that seems needless to me. Do you
> > have an example of a system/network that fails on punycode?
> 
> mDNS (used in Apple's Bonjour). Despite its name, it has little to do
> with DNS, and it requires UTF-8 (and does not use Punycode).

  I was curious about this, having written a DNS library [1].  Saying it has
nothing to do with DNS while being called "Multicast DNS", using the same
encoding scheme as DNS, and covers a portion of the DNS namespace, saying it
has "little to do with DNS" is a bit uncharitable (in my opinion).  It *is*
DNS, over UDP---it just uses a special IP address and different port.

  I was also surprised that UTF-8 characters *are* possible in DNS packets
[2].  I was, however, a bit disappointed that "g?meaux.bortzmeyer.org" and
"xn--gmeaux-bva.bortzmeyer.org" didn't exist.

  -spc

[1]	https://github.com/spc476/SPCDNS

[2]	And I was happy to see my library could successfully deal with such,
	even as I wasn't conseciously aware of doing so.

Link to individual message.

65. Stephane Bortzmeyer (stephane (a) sources.org)

📅 Sent: 2020-12-09 08:12
📧 Message 65 of 68

On Tue, Dec 08, 2020 at 05:20:38PM -0500,
 Sean Conner <sean at conman.org> wrote 
 a message of 29 lines which said:

>   I was also surprised that UTF-8 characters *are* possible in DNS packets
> [2].

It is possible from the beginning, and it has been explicitely said in
RFC 2181, twenty-three years ago. "any binary string whatever can be
used as the label of any resource record".

gemini://gemini.bortzmeyer.org/rfc-mirror/rfc2181.txt

> I was, however, a bit disappointed that "g?meaux.bortzmeyer.org" and
> "xn--gmeaux-bva.bortzmeyer.org" didn't exist.

Then it means your DNS resolver is broken because
xn--gmeaux-bva.bortzmeyer.org (the A-label, the Punycode form) is in
the DNS.

% dig AAAA +noidnout xn--gmeaux-bva.bortzmeyer.org  

; <<>> DiG 9.11.5-P4-5.1+deb10u2-Debian <<>> AAAA +noidnout xn--gmeaux-bva.bortzmeyer.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 18694
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 4, AUTHORITY: 7, ADDITIONAL: 7

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 4096
;; QUESTION SECTION:
;xn--gmeaux-bva.bortzmeyer.org. IN AAAA

;; ANSWER SECTION:
xn--gmeaux-bva.bortzmeyer.org. 86400 IN	CNAME radia.bortzmeyer.org.
xn--gmeaux-bva.bortzmeyer.org. 86400 IN	RRSIG CNAME 8 3 86400 (
				20201218215750 20201204120249 10731 bortzmeyer.org.
				gKtCZZKsjTLdFsSKYtgvz1S+pRkZbxweG+6XOxVhJgYG
				gRzfWB8lhjSPaQ6BNK6YyGQreonObF1R43MDY5oQ66ti
				hNOfPp3/gz4wm5eAy3uzFi7xiwclshsLd0yZEaOPsTo6
				fYKfRp5XCG/yZOg85YdZxJB9LK9q+RIyOycGmI0= )
radia.bortzmeyer.org.	86400 IN AAAA 2001:41d0:302:2200::180
...

Link to individual message.

66. marc (marcx2 (a) welz.org.za)

📅 Sent: 2020-12-09 12:20
📧 Message 66 of 68

Hello again

> > An URL is an address, in the same way that a phone number or an IP
> > is an address. Ideally these are globally unique, unambiguous and
> > representable everywhere. This address scheme should be independent
> > of a localisation.
> >
> > We don't insist that phone numbers are rendered in roman
> > numerals either. My dialing prefix isn't +XXVII. The
> > gemini:// prefix isn't tweeling:// in dutch.
> 
> This theory, in the world of domain names, is wrong. RFC 2277 says...

Your reliance on one RFC as an authority while rejecting
another RFC as "a one-sided anti-internationalization rant"
does not strike me as being consistent.

> > reducing security,
> 
> That's false. I stil wait to see an actual phishing email with
> Unicode. Most of the time, the phisher does not even bother to have a
> realistic URL, they advertise <http://evil.example/famousbank> and it
> works (few people check URL).
> 
> Anyway, the goal of Gemini is not to do onli banking so this is not
> really an issue.

There exists a neat quote by a certain B. Russel on
people who are so very sure of themselves.

The gemini spec fixes the url length in octets. Various 
ways of encoding internationalised data may make it possible 
for a bad guy to shrink and grow urls in unexpected ways 
and clobber this buffer.

The interaction between filesystems, archiving software or
protocol gateways generates many more aliasing problems.

> Now, identifiers, like domain names,
> are a complicated case, since they are both protocol elements and
> text. But, since they are widely visible (in advertisments, business
> cards, etc), I believe they should be internationalized, too.

Imagine a slightly different world where people don't exchange
business cards, but a small amount of sheet music - their own
personal jingle (retrofuturism, right ?). It turns out sheet
music music is annotated in Italian - I think it can say things like
"forte" or "pianissimo".  Would you ago around and angrily
cross out those words to replace them with your local language ?

> > subtle ambiguities (is the cyrilic C the same as a latin - C, who
> > knows ?),
> 
> There is no ambiguity, U+0421 is different from  U+0043.

There are various insults starting with latin C. Rewriting
them to start with cyrilic C doesn't make them any
less insulting.

> > Using unicode in addresses balkanises this global space
> 
> The english-speaking space is not a global space: it is the space of a
> minority of the world population.

[WARNING: wall of text ahead]

I think here we are heading to core of the argument...
of what a language is. And it is a big split
that many don't know how to articulate:

Some see language as a core part of their identity
(who they are) - others see language as a tool for
communicating (a protocol).

I think tying ones identity to a nation/ethnicity and
its language sets one up for conflict both internally
(who one is) and externally (between states). It
is also silly - languages actually evolve quite
rapidly and leave significant imprints on each
other, while people migrate (or get conquered, sadly).

So I think it is better *not* to view english as
the property of a particular ethnicity, but as
as a popular communications protocol - an earlier
protocol might have been latin, which left significant
influences on english - and if mandarin (or hindi,
or whatever) ends up displacing english in turn, then I 
expect there to be many traces of english to be left there
too.

It is easy to envy native english speakers - that
they have it easier. But that is not true - being
multilingual is a real advantage, in so many ways:
Being able to speak an extra language, for instance,
is a major protective factor against dementia... and
every extra language one learns makes it easier
to learn the next. Bible scholars have no issues
acquiring a decent grasp of Hebrew and Ancient Greek,
philosophers might try to read Imanuel Kant in German.

Most of us have arbitrarily tied our identities to
a nation state and thus a that nation's language - it
really doesn't have to be that way.

regards

marc

Link to individual message.

67. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-09 12:56
📧 Message 67 of 68

> On Dec 9, 2020, at 13:20, marc <marcx2 at welz.org.za> wrote:
> 
> Most of us have arbitrarily tied our identities to
> a nation state and thus a that nation's language - it
> really doesn't have to be that way.
> 

Amen to that. International provincialism of sort. 

"That gibberish he talked was Cityspeak, gutter talk, a mishmash of 
Japanese, Spanish, German, what have you. I didn?t really need a 
translator. I knew the lingo, every good cop did. But I wasn?t going to 
make it easier for him."

?Rick Deckard

Nevertheless, it would be nice to be able to type UTF-8 directly in gemini 
requests & text/gemini links and have everything magically work without 
much ado. That would be progress for once.

Word of the week:
https://en.wikipedia.org/wiki/Retrofuturism

Link to individual message.

68. Sean Conner (sean (a) conman.org)

📅 Sent: 2020-12-09 21:44
📧 Message 68 of 68

It was thus said that the Great Stephane Bortzmeyer once stated:
> On Tue, Dec 08, 2020 at 05:20:38PM -0500,
>  Sean Conner <sean at conman.org> wrote 
>  a message of 29 lines which said:
> 
> > I was, however, a bit disappointed that "g?meaux.bortzmeyer.org" and
> > "xn--gmeaux-bva.bortzmeyer.org" didn't exist.
> 
> Then it means your DNS resolver is broken because
> xn--gmeaux-bva.bortzmeyer.org (the A-label, the Punycode form) is in
> the DNS.

  Then I'm going to say this was operator error because I was able to look
up xn--gmeaux-bva.bortzmeyer.org.

  -spc

Link to individual message.

---

Previous Thread: [ANN] Garden Gnome Society!

Next Thread: [ANN] gmisub - Subscribe to gemini pages