πΎ Archived View for gemi.dev βΊ gemini-mailing-list βΊ 000524.gmi captured on 2024-08-19 at 01:02:19. Gemini links have been rewritten to link to archived content
β¬ οΈ Previous capture (2023-12-28)
-=-=-=-=-=-=-
The specification <gemini://gemini.circumlunar.space/docs/specification.gmi> seems silent about IDN (Internationalized Domain Names, domains in Unicode, see RFC 5890). The spec mentions URI (not 3986) but not IRI (Internationalized Resource Identifiers, RFC 3987). Therefore, it is not clear what servers and clients should do (send an IRI, or accept IRI but convert it to URI or something else). A test with some clients seem to indicate it does not work (tested at <gemini://g?meaux.bortzmeyer.org/>):
On Fri, Dec 4, 2020 at 8:53 AM Stephane Bortzmeyer <stephane at sources.org> wrote: Therefore, it is not clear what servers and clients should do (send an > IRI, or accept IRI but convert it to URI or something else). > It seems clear from the behavior of web browsers that the Right Thing is to convert all IDNs to Punycode before putting them on the wire. By the same token, all non-ASCII characters in other parts should be UTF-8 encoded and then %-encoded before transmission. This applies both to IRIs entered by hand and IRIs appearing in links. > * Amfora claims the domain name does not exist (it does exist), > "Failed to connect to the server: dial tcp: lookup > g?meaux.bortzmeyer.org <http://xn--gmeaux-bva.bortzmeyer.org>: no such > host." > I'm pretty sure this is because no punycoding is being done in the DNS, and it's probably getting the UTF-8 encoding instead of " xn--gmeaux-bva.bortzmeyer.org". When I ask Lagrange to connect to the punycoded form explicitly, your server does not recognize it as "self" and replies with "Proxy Request Refused". I can't account for the behavior of the other servers easily. John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org One art / There is / No less / No more To do / All things / With sparks / Galore --Douglas Hofstadter -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201204/77b4 431f/attachment.htm>
On Fri, Dec 04, 2020 at 09:46:40AM -0500, John Cowan <cowan at ccil.org> wrote a message of 93 lines which said: > It seems clear from the behavior of web browsers that the Right Thing is to > convert all IDNs to Punycode before putting them on the wire. I disagree. This is not because HTTP does it that way that everyone else should. IMHO, the right behaviour would be:
> - parse the IRI and extract the domain name > - convert it to Punycode > - do the DNS lookup > - connect to the IP address and send the IRI as request I feel like this is probably the most intuitive method. Only use punycoding when it's a necessity, like for DNS lookups. What about link lines though? I think that clients should accept both punycoded and Unicode domains in links. Convert all links' domain to punycode for DNS, then convert all links' domain to Unicode for sending. That seems a bit complicated, but from an author perspective it makes sense to support both. makeworld
On Fri, Dec 04, 2020 at 06:36:00PM +0000, colecmac at protonmail.com <colecmac at protonmail.com> wrote a message of 15 lines which said: > I feel like this is probably the most intuitive method. Only > use punycoding when it's a necessity, like for DNS lookups. I don't know what is the process for proposing and discussing changes in Gemini specification (or to follow the points in discussion). In the mean time, I've summarized the discussion here: gemini://gemini.bortzmeyer.org/gemini/idn.gmi
> I don't know what is the process for proposing and discussing changes > in Gemini specification (or to follow the points in discussion). It's basically what we're doing. At some point Solderpunk will chime in hopefully, and make a permanent change. > In the mean time, I've summarized the discussion here: > > gemini://gemini.bortzmeyer.org/gemini/idn.gmi Thanks for this. I forgot about certificates. I feel like for the most compatibility, clients should support both the punycoded and Unicode version of the domain in certs. Anyone disagree? As for Unicode normalization, I feel like that's complex, annoying, and hopefully out-of-scope. There should be one Unicode string for each domain only, and I really don't want to have to deal with anything else. Cheers makeworld
On Sun, Dec 06, 2020 at 05:11:48PM +0000, colecmac at protonmail.com <colecmac at protonmail.com> wrote a message of 21 lines which said: > As for Unicode normalization, I feel like that's complex, annoying, > and hopefully out-of-scope. There should be one Unicode string for > each domain only, and I really don't want to have to deal with > anything else. Well, OK so let's settle with NFC for everybody? (Since this is what RFC 5198 says.)
> > As for Unicode normalization, I feel like that's complex, annoying, > > and hopefully out-of-scope. There should be one Unicode string for > > each domain only, and I really don't want to have to deal with > > anything else. > > Well, OK so let's settle with NFC for everybody? (Since this is what > RFC 5198 says.) Do you mean all clients should do NFC? That seems to me like would make Gemini quite a bit more complex. I feel like NFC should be on the user, if that's possible. Also I wonder how often this issue actually occurs, are users really typing U+0065 U+0301 (e?) instead of U+00E9 (?) ? makeworld
On Sunday, December 6, 2020 6:05 PM, A. E. Spencer-Reed <easrng at gmail.com> wrote: > > are users really typing U+0065 U+0301 (e?) instead of U+00E9 (?) ? > > In links, probably not, but maybe in the address bar. However, I was > under the impression that precomposed characters are no longer > supposed to be used, am I horribly wrong? I don't know about that, but I hope that's true. It helps reinforce my idea that normalization is way out of scope for Gemini clients. makeworld (Note I've CC'ed the gemini list, I think you forgot to Reply All in your message)
I'm not sure if the first email is the right one to reply to in this case, but I've summarized the suggestions of this thread here: https://github.com/makeworld-the-better-one/go-gemini/issues/10 I hope this makes it easier for other client authors to figure out what to do, as well as for Solderpunk to make an official decision. Cheers, makeworld
On Sunday, December 6, 2020 7:40 PM, A. E. Spencer-Reed <easrng at gmail.com> wrote: > > Do you mean all clients should do NFC? That seems to me like would make > > Gemini quite a bit more complex. > > Isn't that usually handled by the standard library anyway? I suppose, yeah. Many languages might need to import a package, but it won't be something the programmer is doing themselves, just like TLS. I'm just wary of bringing in another large dependency, and I know Unicode to be something complex, and something that will require updates. I would very much like to hear Solderpunk's opinion on this. makeworld (You forgot to use Reply-All again, I've CC'ed the list.)
> > > Do you mean all clients should do NFC? That seems to me like would make > > > Gemini quite a bit more complex. > > > > Isn't that usually handled by the standard library anyway? > > I suppose, yeah. Many languages might need to import a package, but it won't be > something the programmer is doing themselves, just like TLS. > > I'm just wary of bringing in another large dependency, and I know Unicode to > be something complex, and something that will require updates. I would very > much like to hear Solderpunk's opinion on this. Having looked into it[1], it doesn't look that complicated, for Go anyway. Perhaps it should be recommended for clients to do, but not required. While the other things like punycoding and sending the IDN to the server would be required. makeworld 1: https://github.com/makeworld-the-better-one/go-gemini/issues/10#issuecomment-739604051
> I feel like this is probably the most intuitive method. Only > use punycoding when it's a necessity, like for DNS lookups. > > What about link lines though? I think that clients should > accept both punycoded and Unicode domains in links. Convert > all links' domain to punycode for DNS, then convert all links' > domain to Unicode for sending. That seems a bit complicated, > but from an author perspective it makes sense to support both. Allowing IRIs is a *really big and breaking change*. Right now, checking whether a gemini request is valid can be done really easily, even in a language like C and with no external dependencies. Converting the IRI to an URI in the client and having the server configured with the punycode seems like a much cleaner, simpler and even robust solution to me. bie
bie wrote: > Allowing IRIs is a really big and breaking change. Stephane Bortzmeyer mentioned IRIs in an earlier email in this thread. I think that was probably a mistake, and if not, then I don't support it. I should have caught it at the time but I didn't, sorry. All I have been talking about the entire time, in this thread and in the GitHub issue[1] you quote from, is IDNs -- Internationalized Domain Names. You're right that using IRIs over URIs would be a big change, and a bad one. I'm only talking about converting and messing with domains. In my opinion, to keep things simple, no client should deal with IRIs at all. I hope that sets the record straight. I'm only talking about domains. Thanks for allowing me to clarify that. >> What about link lines though? This quote from the issue has been removed now. It was in reference to how Amfora should work interally, not what low-level clients should do. I hope the issue is more clear now. No IRIs! :) Cheers, makeworld 1: https://github.com/makeworld-the-better-one/go-gemini/issues/10
> On Dec 7, 2020, at 00:27, colecmac at protonmail.com wrote: > > (Note I've CC'ed the gemini list, I think you forgot to Reply All in your message) Sigh.
It's 2020, can we please be allowed to use french in our links? It makes no sense I'd need to know two weird translitteration schemes by heart before I can link to ?checs.fr/fran?ais I get that we need to encode spaces and other special delimiter characters, but other than that, what's the rationnal in limiting to ascii? MCMic
On Mon, Dec 07, 2020 at 09:24:08AM +0100, C?me Chilliet wrote: > It's 2020, can we please be allowed to use french in our links? > > It makes no sense I'd need to know two weird translitteration schemes by heart before I can link to ?checs.fr/fran?ais > > I get that we need to encode spaces and other special delimiter characters, but other than that, what's the rationnal in limiting to ascii? > MCMic There is one really good reason - it won't work well with existing servers and clients. Most servers (and especially servers that follow the spec) currently only accept requests that provide a valid URI - so a request that contains something outside the set of valid 84 characters should not be accepted. Asking for servers to start accepting IRIs is a big change, and a breaking change in my opinion, one that adds a lot of complexity for very little value added. Allowing such links in text/gemini, but asking clients to handle the percent-encoding in the background has a similar problem - it goes against what every single client is doing now. The best solution, in my opinion, is to stick to URIs. If someone really wants to be able to type links like your example into their text/gemini files, there's even a solution for that - create as server that processes the .gmi files on the fly and sends punycode/percent-encoded links to the client. bie
Le lundi 7 d?cembre 2020, 10:29:34 CET bie a ?crit : > > I get that we need to encode spaces and other special delimiter characters, but other than that, what's the rationnal in limiting to ascii? > > MCMic > > There is one really good reason - it won't work well with existing > servers and clients. > > Most servers (and especially servers that follow the spec) currently > only accept requests that provide a valid URI - so a request that > contains something outside the set of valid 84 characters should not be > accepted. Asking for servers to start accepting IRIs is a big change, > and a breaking change in my opinion, one that adds a lot of complexity > for very little value added. I have to disagree that using my own language to name files and pages on my own server is of ?very little value?. Servers already have to output utf-8, why not accept utf-8 in the input? > Allowing such links in text/gemini, but asking clients to handle the > percent-encoding in the background has a similar problem - it goes > against what every single client is doing now. I am not asking clients to handle percent-encoding, I expect my server to receive the utf-8 I put in the link. I just tried, my server handles gemini://gemlog.lanterne.chilliet.eu/fran?ais-test.gmi with no problem. (I coded the server myself, but I did not put any effort into supporting this, I never tried it before). Most clients will percent encode the request, but with lagrange if I enter it like this in the adress bar my server does receive the request not percent encoded and reacts well. (It also reacts well if percent encoded of course since it decodes to the same name). > The best solution, in my opinion, is to stick to URIs. If someone really > wants to be able to type links like your example into their text/gemini > files, there's even a solution for that - create as server that processes > the .gmi files on the fly and sends punycode/percent-encoded links to > the client. Gemini has taken steps in the right direction by defaulting to uft-8 and specifying that there is no default value for lang. It would make a lot of sense to accept utf-8 in request as well and not arbitrarly limit to ascii, just because of web history. I understand that punycode will have to be used for the DNS lookup, but that?s on DNS specification, not Gemini?s reponsibility. But I fail to see why the Gemini request should be punycoded, or percent encoded except for special delimiter characters. MCMic
On Sun, Dec 06, 2020 at 05:30:16PM +0000, colecmac at protonmail.com <colecmac at protonmail.com> wrote a message of 15 lines which said: > I feel like NFC should be on the user, if that's possible. Also I > wonder how often this issue actually occurs, are users really typing > U+0065 U+0301 (e?) instead of U+00E9 (?) ? Users don't input Unicode code points! They input characters (using various methods), and, behind the scene, the input methods they use produce code points. This is typically not under the control of the user.
On Sun, Dec 06, 2020 at 11:27:11PM +0000, colecmac at protonmail.com <colecmac at protonmail.com> wrote a message of 18 lines which said: > > However, I was under the impression that precomposed characters > > are no longer supposed to be used, am I horribly wrong? > > I don't know about that, but I hope that's true. Quite the contrary. RFC 5198 mandates NFC, which maps many characters to the precomposed form.
On Mon, Dec 07, 2020 at 03:24:25AM +0000, colecmac at protonmail.com <colecmac at protonmail.com> wrote a message of 28 lines which said: > Stephane Bortzmeyer mentioned IRIs in an earlier email in this thread. > I think that was probably a mistake, No, it wasn't. But it's true that there are two technically different issues, the domain name and the path which, unfortunately, may require different treatments. >From the point of view of users, I believe it will be hard to explain that Unicode characters are allowed in the domain name but not in the path, or vice-versa.
On Mon, Dec 07, 2020 at 09:24:08AM +0100, C?me Chilliet <come at chilliet.eu> wrote a message of 8 lines which said: > It's 2020, can we please be allowed to use french in our links? And it is even more important for people who use scripts like arabic, chinese, devanageri, etc. > what's the rationnal in limiting to ascii? Well, properly handling IRI require to change the specification, and also to change software. One of the points of Gemini being to be simple to implement, this certainly requires consideration. However, it is an issue similar to the TLS one: TLS is big and complicated (certainly even more than Unicode) and yet Gemini *requires* it. After all, it will typically be handled by a library, not by the guy or gal who writes yet another Gemini server. Same thing for Unicode.
On Mon, Dec 07, 2020 at 06:29:34PM +0900, bie <bie at 202x.moe> wrote a message of 29 lines which said: > There is one really good reason - it won't work well with existing > servers and clients. This is a bad reason since the specification is not stabilized yet and Gemini is basically very experimental. Nothing is cast in stone and we don't have to maintain compatibility. > for very little value added. I strongly disagree. If Gemini is only for the world elites who speak english, it is much less interesting.
> > for very little value added. > > I strongly disagree. If Gemini is only for the world elites who speak > english, it is much less interesting. It's not, though. I've got servers running on international domain names, and the majority of the pages I'm serving have Japanese characters in the paths. This works *today* in every single gemini client I've tried, because the paths are valid URIs (percent-encoded) and the domains work with punycode. Nice clients can show decoded paths and decoded domains, but the beauty of the current approach is that they don't have to. bie
On Mon, Dec 07, 2020 at 11:47:19AM +0100, Stephane Bortzmeyer <stephane at sources.org> wrote a message of 14 lines which said: > No, it wasn't. But it's true that there are two technically different > issues, the domain name and the path which, unfortunately, may require > different treatments. > > From the point of view of users, I believe it will be hard to explain > that Unicode characters are allowed in the domain name but not in the > path, or vice-versa. An example of an IRI issue with the Lagrange client <https://github.com/skyjake/lagrange/issues/73>
Hi > > It's 2020, can we please be allowed to use french in our links? > > And it is even more important for people who use scripts like arabic, > chinese, devanageri, etc. I am to be convinced that unicode URLs are a good thing. And I say that as a native speaker of a language which includes glyphs which aren't in US ASCII. An URL is an address, in the same way that a phone number or an IP is an address. Ideally these are globally unique, unambiguous and representable everywhere. This address scheme should be independent of a localisation. We don't insist that phone numbers are rendered in roman numerals either. My dialing prefix isn't +XXVII. The gemini:// prefix isn't tweeling:// in dutch. Using unicode in addresses balkanises this global space into separate little domains, with subtle ambiguities (is the cyrilic C the same as a latin - C, who knows ?), reducing security, and making crossover harder. If somebody points me at an url in kanji or ethiopian, I would have great difficulty remembering nevermind recreating it, even if the photo there is useful to the rest of the world. If you are saying what about the guy from Ethiopia - well, I suspect he would have trouble with kanji too... without a common denominator this is an N^2 problem. I appreciate that many languages are in decline and even facing extinction - but interacting with the internet requires a jargon or specialisation anyway, in the same way that botanists invoke latin names, mathematicians write about eigenvectors and brain surgeons talk about the hippocampus, all regardless of which languages they speak at home. TLDR: The words after the gemini => link can be unicode, the link itself should not. regards marc
> > And it is even more important for people who use scripts like arabic, > > chinese, devanageri, etc. > > I am to be convinced that unicode URLs are a good thing. > > And I say that as a native speaker of a language which > includes glyphs which aren't in US ASCII. > > An URL is an address, in the same way that a phone > number or an IP is an address. Ideally these are globally > unique, unambiguous and representable everywhere. > This address scheme should be independent of a localisation. > > We don't insist that phone numbers are rendered in roman > numerals either. My dialing prefix isn't +XXVII. The > gemini:// prefix isn't tweeling:// in dutch. > > Using unicode in addresses balkanises this global space into > separate little domains, with subtle ambiguities (is the > cyrilic C the same as a latin - C, who knows ?), reducing > security, and making crossover harder. If somebody points > me at an url in kanji or ethiopian, I would have great > difficulty remembering nevermind recreating it, even if the > photo there is useful to the rest of the world. If you > are saying what about the guy from Ethiopia - well, I suspect he > would have trouble with kanji too... without a common > denominator this is an N^2 problem. > > I appreciate that many languages are in decline and even > facing extinction - but interacting with the internet requires > a jargon or specialisation anyway, in the same way that botanists > invoke latin names, mathematicians write about eigenvectors > and brain surgeons talk about the hippocampus, all regardless > of which languages they speak at home. > > TLDR: The words after the gemini => link can be unicode, the > link itself should not. I mostly agree with this in the sense that the protocol and text/gemini should stick to URLs that are URI-safe (nothing outside the safe 80-something characters). That said, I don't think there's anything wrong with a friendly client showing percent-decoded unicode representations of a path or punycode-decoded representations of an international domain name in the address bar or anywhere else in the interface. In the same vein, if a server wants to be extra friendly to gmi file authors, it can, like I suggested earlier, allow users to name and link to files in unicode, but percent-encode everything before sending it to over the wire. I actually implemented this in my personal gemini server today, and it was a trivial change (especially when compared to what I'd have to do to properly validate IRIs...), allowing me to write "=> ??/ ??" and have it sent to the client as "=> %e%9b%91%e5%bf%b5/ ??". bie
On Monday, December 7, 2020 5:44 AM, Stephane Bortzmeyer <stephane at sources.org> wrote: > On Sun, Dec 06, 2020 at 11:27:11PM +0000, > colecmac at protonmail.com colecmac at protonmail.com wrote > a message of 18 lines which said: > > > > However, I was under the impression that precomposed characters > > > are no longer supposed to be used, am I horribly wrong? > > > > I don't know about that, but I hope that's true. > > Quite the contrary. RFC 5198 mandates NFC, which maps many characters > to the precomposed form. Yep, you're right. I misread and thought this was being said about the decomposed form. makeworld
Not sure exactly where to jump in, so I'm gonna share my thoughts here. I thinking having IRIs would be nice, and I feel bad that currently non-english authors are second-classed in this manner. But from the beginning Gemini has been about being simple -- not for authors, but for programmers. It was intended to be able to implemented in a weekend, in a few hundred lines, ideally without even needing libraries outside your language's stdlib. Supporting IRIs is *not* simple. For example, in Python it requires a third-party library[1], and in Go I wasn't even able to find one. This means that in many programming languages, no one would be able to even begin writing a Gemini client before writing a library that parses and conforms to the complex specification that is IRIs. Secondly, this would be a large breaking change for Gemini. Even if IRIs were supported in all programming languages, I don't think making breaking changes to Gemini is feasible at this point. Things are too set, and attempting to do this would break the ecosystem. Lower down in the thread, Stephane Bortzmeyer mentions: > From the point of view of users, I believe it will be hard to explain > that Unicode characters are allowed in the domain name but not in the > path, or vice-versa. This is true and unfortunate. My proposal[2] is only about domain names, and so this would have to be explained to users. But as I've outlined above, using IRIs would be virtually impossible, and so I think supporting IDNs in link lines is the best we can give non-english authors. 1: https://stackoverflow.com/a/12565315/7361270 2: https://github.com/makeworld-the-better-one/go-gemini/issues/10 Thanks, makeworld
> On Dec 7, 2020, at 14:09, bie <bie at 202x.moe> wrote: > > That said, I don't think there's anything wrong with a friendly client > showing percent-decoded unicode representations of a path or > punycode-decoded representations of an international domain name in the > address bar or anywhere else in the interface. > > In the same vein, if a server wants to be extra friendly to gmi file > authors, it can, like I suggested earlier, allow users to name and link > to files in unicode, but percent-encode everything before sending it to > over the wire. This. It's the job of the internationally minded client and server to do the proper legwork for the end user so the over-the-wire format is correct. No need to change anything in the protocol itself, but rather an opportunity for clients and servers to distinguish themselves. Alternatively: Unidecode! https://interglacial.com/tpj/22/ Sean M. Burke, Winter, 2001
On Monday, December 7, 2020 8:56 AM, <colecmac at protonmail.com> wrote: > Not sure exactly where to jump in, so I'm gonna share my thoughts here. > <snip> One last thing to add to this email: Now that I've outlined the issues with IRIs, could we get back to talking about the original IDN idea? Does anyone have issues with this proposal[1]? I'm hoping to consolidate everything there, and Solderpunk can look at that and make his decision. 1: https://github.com/makeworld-the-better-one/go-gemini/issues/10 Cheers, makeworld
On Mon, Dec 07, 2020 at 08:19:07PM +0900, bie <bie at 202x.moe> wrote a message of 17 lines which said: > I've got servers running on international domain names, and the majority > of the pages I'm serving have Japanese characters in the paths. > This works *today* in every single gemini client I've tried, I don't know which ones you tried but Amfora, AV-98, Bombadillo and Lagrange all fail on such names.
On Mon, Dec 07, 2020 at 01:30:41PM +0100, marc <marcx2 at welz.org.za> wrote a message of 45 lines which said: > An URL is an address, in the same way that a phone number or an IP > is an address. Ideally these are globally unique, unambiguous and > representable everywhere. This address scheme should be independent > of a localisation. This theory, in the world of domain names, is wrong. RFC 2277 says that "protocol elements" (basically, the things the user does not see such as the MIME type text/gemini) do not have to be internationalized. Everything else ("text", says the RFC) must be internationalized, simply because the world is like that, with multiple scripts and languages. Now, identifiers, like domain names, are a complicated case, since they are both protocol elements and text. But, since they are widely visible (in advertisments, business cards, etc), I believe they should be internationalized, too. > Using unicode in addresses balkanises this global space The english-speaking space is not a global space: it is the space of a minority of the world population. > subtle ambiguities (is the cyrilic C the same as a latin - C, who > knows ?), There is no ambiguity, U+0421 is different from U+0043. > reducing security, That's false. I stil wait to see an actual phishing email with Unicode. Most of the time, the phisher does not even bother to have a realistic URL, they advertise <http://evil.example/famousbank> and it works (few people check URL). Anyway, the goal of Gemini is not to do onli banking so this is not really an issue. > If somebody points me at an url in kanji or ethiopian, I would have > great difficulty remembering nevermind recreating it, It is safe to assume that a URL in ethiopian is for people who speak the relevant language so it is not a problem. > without a common denominator this is an N^2 problem. There is no common denominator (unless someone decided that everybody must use english but I don't remember such decision). > but interacting with the internet requires > a jargon or specialisation anyway, in the same way that botanists > invoke latin names, mathematicians write about eigenvectors > and brain surgeons talk about the hippocampus, all regardless > of which languages they speak at home. OK, then let's all use Hangul for URL. (It's a nice script, very regular, so it is convenient for computer programs.)
On Mon, Dec 07, 2020 at 03:07:49PM +0100, Petite Abeille <petite.abeille at gmail.com> wrote a message of 31 lines which said: > It's the job of the internationally minded client and server to do > the proper legwork for the end user so the over-the-wire format is > correct. > No need to change anything in the protocol itself, but rather an > opportunity for clients and servers to distinguish themselves. This is what gives us the current situation. There is no interoperability because each client and server did it in a different way, or not at all. Since Gemini (for good reasons) have no User-Agent, no negotiation of options, we must specify clearly how Unicode is handled or the geminispace won't be safe for Unicode.
> > I've got servers running on international domain names, and the majority > > of the pages I'm serving have Japanese characters in the paths. > > This works *today* in every single gemini client I've tried, > > I don't know which ones you tried but Amfora, AV-98, Bombadillo and > Lagrange all fail on such names. You cut off the important part of my reply, which specifies that the names are percent-encoded or punycoded... here are some examples, all work in Amfora, AV-98 and Lagrange (probably Bombadillo too): gemini://blekksprut.net/%e6%97%a5%e5%b8%b8%e9%91%91%e8%b3%9e/ (a friendly client could choose to display this as gemini://blekksprut.net/????) gemini://xn--td2a.jp/ (a friendly client could choose to display this as gemini://?.jp/) Even in the simplest of user-agents, these URIs work, and more advanced clients clients can choose to display them in more user-friendly ways. bie
> On Dec 7, 2020, at 16:34, Stephane Bortzmeyer <stephane at sources.org> wrote: > > This is what gives us the current situation. There is no > interoperability because each client and server did it in a different > way, or not at all. As pointed out -and demonstrated- by <bie> multiple time, all is good... as long as one bothers to properly encode everything :) And of course, clients and servers may want to go the extra length to facilitate the encoding for the end users.
> On Dec 7, 2020, at 16:32, Stephane Bortzmeyer <stephane at sources.org> wrote: > > The english-speaking space is not a global space No, but the internet plumbing is de facto US-ASCII. This doesn't have to be a problem if one doesn't make it so. ?How can you govern a planet which has 1,965,246 varieties of encoding?? ? Charlie de la Gaule
On Mon, 7 Dec 2020 17:33:23 +0100 Petite Abeille <petite.abeille at gmail.com>: > > On Dec 7, 2020, at 16:32, Stephane Bortzmeyer <stephane at sources.org> wrote: > > > > The english-speaking space is not a global space > > No, but the internet plumbing is de facto US-ASCII. If you don't start somewhere, that will never improve. > This doesn't have to be a problem if one doesn't make it so. > > ?How can you govern a planet which has 1,965,246 varieties of encoding?? > ? Charlie de la Gaule I'm pretty sure it was about cheeses in the original quote :)
Hi, Some thoughts on answers on the topic of unicode links. (I will focus on unicode in path rather than in domain here). First, I wanted to point out that almost no-one uses them on the french Web. Some used that as an argument against having unicode in URIs, but I think no one uses them because of the punycode and percent encoding weirdness. I read part of the RFC 3987 (IRI) and part of RFC 3986 (URI) and still do not understand what is the horrible added complexity you are talking about. Could people asserting IRI is a complex hell impossible to implement point to the exact problems with IRI? Here is the life cycle of a link in a page: 1 - The author writes it 2 - The server saves it 3 - A client requires the page to the server 4 - The server sends it 5 - The client display it 6 - The user click it 7 - The client resolve the hostname 8 - The client sends it as request to the server 9 - The server fetches the associated page I think we can safely assume that the author will not write percent encoding without help. So, with bie suggestions that clients and servers help by percent-encoding, but the author/user only have to deal with unicode, it means: 1 - somewhere between step 1 and step 4, the server have to percent-encode the link 2 - somewhere between step 4 and 5 the client needs to decode it 3 - In 8 either the client stored the encoded link or has to reencode again, or if someone copy/paste he has to reencode. 4 - In 9 the server needs to decode it to get the real target path If we just use the utf-8 path all along, points 1 throuh 3 are not needed. 4 still is, because some links will still be percent encoded and the server needs to understand them. > Petite Abeille <petite.abeille at gmail.com>: > No, but the internet plumbing is de facto US-ASCII. If this is true, why bother with responses in utf-8? Regarding the breaking change argument, I think it is a bit weak, testing shows there is no consistency in how different clients/servers handles unicode currently. > bie: > I actually implemented this in my personal gemini server > today, and it was a trivial change (especially when compared to what I'd0 > have to do to properly validate IRIs...), allowing me to write "=> ??/ > ??" and have it sent to the client as "=> %e%9b%91%e5%bf%b5/ ??". If you are all this confortable with links that looks like ?%e%9b%91%e5%bf%b5? lets go the whole way and percent-encode ascii as well. Let?s see how long before you change your mind after using this kind of stuff on a daily basis. And at least this would put all languages at the same point. > colecmac at protonmail.com > Supporting IRIs is *not* simple. For example, in Python it requires a > third-party library[1], and in Go I wasn't even able to find one. This > means that in many programming languages, no one would be able to even > begin writing a Gemini client before writing a library that parses and > conforms to the complex specification that is IRIs. On the server I wrote in PHP, getting a request in UTF-8 worked without me doing anything for it. Not accepting IRI would actually require me to
C?me's reply here asserts that a client would never need to parse IRIs, and so there's no added complexity. Just copy the IRI from the link line, do DNS, and send the IRI to the server. But this is not true, a client would need to do parsing. What parsing would a client have to do? - Extracting the domain, so it can be punycoded for DNS lookups - Resolving relative IRIs would require parsing the current IRI, and the provided one, and combining them. You cannot just copy it to make the request. - When receiving an input status code on a page that already has a query string, the IRI has to be parsed to detect that there is a query string, and then remove and replace it with the new input of the user. - Extracting the path to get a name for downloading files - Etc. There are many reasons why a client would need to be able to parse an IRI, the relative link one and DNS one being the most important. This would then require IRI parsing libraries, and as I have explained earlier, these don't exist in likely many programming languages, and when they do, they are third-party. For this reason, as well as the previously stated reason of this being a large breaking change, I can't support a switch to IRIs. IDNs, on the other hand... :) Cheers, makeworld
On Mon Dec 7, 2020 at 7:30 AM EST, marc wrote: > Using unicode in addresses balkanises this global space into > separate little domains, with subtle ambiguities (is the > cyrilic C the same as a latin - C, who knows ?), reducing > security, and making crossover harder. I don't think that using unicode in addresses would decrease security because of the way that Gemini handles client authentication. Since client certificates are limited to certain domains and paths, the certificate will never be applied to the wrong domain, even if it looks the same to the user.
> On Dec 7, 2020, at 18:01, Solene Rapenne <solene at perso.pw> wrote: > >> No, but the internet plumbing is de facto US-ASCII. > > If you don't start somewhere, that will never improve. Would it make it less controversial if we refer to it as ISO-IR-006 encoding? :D
> On Dec 7, 2020, at 18:35, C?me Chilliet <come at chilliet.eu> wrote: > > If this is true, why bother with responses in utf-8? The response has a textual part at time, which is UTF-8 encoded. Assuming a 2x response code, the content itself is defined by its content-type, which can be anything, in any encoding, following any relevant convention. UTF-8 (aka Universal Coded Character Set Transformation Format ? 8-bit) itself is an encoding of Unicode. There is no such things as plain text in 2020. But this is about URLs, no? As long as gemini follows established standards, then one must deal with encodings as defined by those standards. Not sure why this is controversial. The tooling exists. No one writes UTF-8 by hand. Ditto for URLs encoding/decoding. Use the Tools, Luke. P.S. There is perhaps a bit of a sleight of hands running through gemini's rhetoric about how "simple" everything is. But nothing is *that* simple once one looks at the details. The rabbit hole runs deep. Rome was not build in one day. Nor are gemini's foundations. P.P.S. For entertainment purpose, the DNS RFC dependency graph [pdf]: https://emaillab.jp/wp/wp-content/uploads/2017/11/RFC-DNS.pdf
> On Dec 7, 2020, at 18:35, C?me Chilliet <come at chilliet.eu> wrote: > > First, I wanted to point out that almost no-one uses them on the french Web. Some used that as an argument against having unicode in URIs, but I think no one uses them because of the punycode and percent encoding weirdness. Your very own email address is a good example of where tooling makes a difference. It nicely reads as C?me Chilliet <come at chilliet.eu> -with accent circonflexe & all- but of course is ISO-IR-006 encoded under the hood as =?ISO-8859-1?Q?C=F4me?= Chilliet <come at chilliet.eu>. I suspect you didn't type the encoding by hand, nor thing about it twice. It "just" works :)
On 12/7/20 12:00 PM, colecmac at protonmail.com wrote: > What parsing would a client have to do? > > - Extracting the domain, so it can be punycoded for DNS lookups > Can we be sure gemini host resolution will always use the global DNS? Section 4 of RFC 6055 cautions against assuming that all name resolution is using the global DNS and therefore that querying with punycode domain names will succeed: ? It is inappropriate for an application that calls a general-purpose name resolution library to convert a name to an A-label unless the application is absolutely certain that, in all environments where the application might be used, only the global DNS that uses IDNA A-labels actually will be used to resolve the name. Conversely, querying with utf8 domain names fails on Ubuntu 20.04 using systemd-resolved [1]. Some languages/libraries such as Python convert utf8 requests to punycode silently before submitting the request to the resolver [2]. [1] C program fails without punycode conversion #include <netdb.h> #include <stdio.h> #include <arpa/inet.h> #include <netinet/in.h> #include <sys/socket.h> int show_ip(char *name) { struct hostent *entry; entry = gethostbyname(name); if (entry) { printf("name '%s' has ip address\n", entry->h_name); printf("ip: %s\n\n",inet_ntoa(*(struct in_addr*)entry->h_name)); } else { printf("error querying '%s': %s\n", name, hstrerror(h_errno)); } } int main() { show_ip("xn--td2a.jp"); show_ip("?.jp"); } [2] Python program succeeds with *implicit* punycode conversion import socket def show_ip(name): print("name '%s' has ip '%s'" % (name, (socket.gethostbyname(name)))) show_ip('xn--td2a.jp') show_ip('?.jp')
> On Dec 7, 2020, at 19:00, colecmac at protonmail.com wrote: > > IDNs, on the other hand... :) The "Internationalized Domain Names (IDN) FAQ" makes for entertaining reading: https://unicode.org/faq/idn.html Special mention of our very own St?phane Bortzmeyer under "Doesn't the removal of symbols and punctuation in IDNA2008 help security?": Le hame?onnage n'a pas de rapport avec les IDN https://www.bortzmeyer.org/idn-et-phishing.html (short answer: no) All encrypted in French sadly :P Happy hame?onnage. Fun, fun, fun.
On Monday, December 7, 2020 4:00 PM, Scot <gmi1 at scotdoyle.com> wrote: > On 12/7/20 12:00 PM, colecmac at protonmail.com wrote: > > > What parsing would a client have to do? > > > > - Extracting the domain, so it can be punycoded for DNS lookups > > Can we be sure gemini host resolution will always use the global DNS? > > Section 4 of RFC 6055 cautions against assuming that all name resolution > is using the global DNS and therefore that querying with punycode > domain names will succeed: > > ? It is inappropriate for an application that calls a general-purpose > name resolution library to convert a name to an A-label unless the > application is absolutely certain that, in all environments where the > application might be used, only the global DNS that uses IDNA > A-labels actually will be used to resolve the name. That's interesting, thanks for sharing. However, it seems obvious to me that punycoding is a necessity, since the global DNS system won't work without it. I've worked with offline mesh network systems, but never had to handle Unicode domain names. However, all of our stack was software that was intended to work on the Internet, as well as any other network. Standard DNS servers, standard OS and stdlib DNS resolvers, etc. So punycoding would be the right way to do it in that network too. Despite what this RFC says, I don't see what situation would actually completely fail on punycoded domains. I guess the spec could mandate trying with punycode first, than Unicode, but that seems needless to me. Do you have an example of a system/network that fails on punycode? > Conversely, querying with utf8 domain names fails on Ubuntu 20.04 > using systemd-resolved [1]. Yep, that's what I meant when I called it a necessity. > Some languages/libraries such as Python convert utf8 requests to > punycode silently before submitting the request to the resolver [2]. That's pretty handy, but it doesn't change my advice. The spec can state that all domains must by punycoded for DNS, and maybe your library will handle that or not. Even if an unware Pythonista manually punycodes the domain, nothing bad will happen when the library tries again. Cheers, makeworld
> On Dec 7, 2020, at 18:35, C?me Chilliet <come at chilliet.eu> wrote: > > (On a more general note, I guess everyone understood english is not my mother tongue, sorry if I?m being rude or something like that, I?m not trying to. I just really believe using utf-8 here would be better, but I understand there are complex technical questions involved) (hopefully) this space operates under the so-called "Crocker's Rules"*: (perhaps) worthwhile quoting in full: Declaring yourself to be operating by "Crocker's Rules" means that other people are allowed to optimize their messages for information, not for being nice to you. Crocker's Rules means that you have accepted full responsibility for the operation of your own mind - if you're offended, it's your fault. Anyone is allowed to call you a moron and claim to be doing you a favor. (Which, in point of fact, they would be. One of the big problems with this culture is that everyone's afraid to tell you you're wrong, or they think they have to dance around it.) Two people using Crocker's Rules should be able to communicate all relevant information in the minimum amount of time, without paraphrasing or social formatting. Obviously, don't declare yourself to be operating by Crocker's Rules unless you have that kind of mental discipline. Note that Crocker's Rules does not mean you can insult people; it means that other people don't have to worry about whether they are insulting you. Crocker's Rules are a discipline, not a privilege. Furthermore, taking advantage of Crocker's Rules does not imply reciprocity. How could it? Crocker's Rules are something you do for yourself, to maximize information received - not something you grit your teeth over and do as a favor. "Crocker's Rules" are named after Lee Daniel Crocker. http://sl4.org/crocker.html https://en.wikipedia.org/wiki/Lee_Daniel_Crocker
It was thus said that the Great C?me Chilliet once stated: > Hi, > > Some thoughts on answers on the topic of unicode links. (I will focus on > unicode in path rather than in domain here). > > First, I wanted to point out that almost no-one uses them on the french > Web. Some used that as an argument against having unicode in URIs, but I > think no one uses them because of the punycode and percent encoding > weirdness. > > I read part of the RFC 3987 (IRI) and part of RFC 3986 (URI) and still do > not understand what is the horrible added complexity you are talking > about. Could people asserting IRI is a complex hell impossible to > implement point to the exact problems with IRI? I'm reading through RFC-3987, and sections 4 and 5 give me pause. Section 4 relates to bidirectional IRIs (right-to-left languages). This is mostly a client issue (I think) with the displaying of such. Section 5 is the scarier of the two---normalization and comparison, and would most likely affect servers than clients (again, I think). There are two examples given: http://www.example.org/résumé.html http://www.example.org/résumé.html The first uses a precomposed character and the second uses a combining character. I'm looking at the Unicode normalization standard [1], and the first thing that struck me was I had *not* thought of the order of multiple combining characters. Oh, there's also Hangul and conjoining jamo. And then ... well, I'll spare the horrors of that 32k document, but the upshot is---yes, that's yet *another* library I have to track down (and update as the Unicode standard is regularly updated). Also, related question---what's the filename on the server? The "horrible added complexity" is not RFC-3987 per se, but the "horrible added complexity" of Unicode normalization that is required. Is that a valid excuse? Perhaps not. But there *is* the issue that a lot of people are having with Python 3 and filenames. If you hit a filename that isn't UTF-8 and the Python 3 script breaks badly. Yes, there *is* a link in my mind between these two issues but I'm not sure I can verbalize it coherently at this time. Perhaps "I will focus on unicode in the path" reminided me of the Python 3 issue. > Regarding the breaking change argument, I think it is a bit weak, testing > shows there is no consistency in how different clients/servers handles > unicode currently. ... > (Note that these are real non-rethorical questions, I?m not trying to deny > that handling IRI would be hard, I?m trying to understand why) Methinks you inadverantly answered your own question---Unicode is *not* easy [1][2][3][4]. -spc [1] https://www.unicode.org/reports/tr15/tr15-50.html [2] https://www.unicode.org/reports/tr9/tr9-42.html [3] https://www.unicode.org/reports/tr14/tr14-45.html [4] Among others. The full current standard: http://www.unicode.org/versions/Unicode13.0.0/
Le lundi 7 d?cembre 2020, 19:00:02 CET colecmac at protonmail.com a ?crit : > C?me's reply here asserts that a client would never need to parse > IRIs, and so there's no added complexity. Just copy the IRI from the > link line, do DNS, and send the IRI to the server. But this is not > true, a client would need to do parsing. > > What parsing would a client have to do? > > - Extracting the domain, so it can be punycoded for DNS lookups True, thanks for pointing that out. > - Resolving relative IRIs would require parsing the current IRI, > and the provided one, and combining them. You cannot just copy it > to make the request. Also true, but it should be the same value that is extracted for DNS. > - When receiving an input status code on a page that already has a > query string, the IRI has to be parsed to detect that there is a > query string, and then remove and replace it with the new input of > the user. Good to know, I did not think of query string situation. > - Extracting the path to get a name for downloading files > - Etc. > > There are many reasons why a client would need to be able to parse an > IRI, the relative link one and DNS one being the most important. > > This would then require IRI parsing libraries, and as I have explained > earlier, these don't exist in likely many programming languages, and > when they do, they are third-party. From what you said on irc, the situation is different between URI and IRI because most languages have URI parsing either in their stdlib or in a well tested known library. But, if no project use IRI, of course no one will write a library for it, this is a chicken and egg situation here. Also, for the purpose of a client, it seems to me the parsing needed (domain and query extraction) is only to search for the first "/" and the last "?", and some minor tweaks on the scheme maybe (which does not contain unicode, I will leave the scheme alone, promise). Note: Just tried gemini://gemini.circumlunar.space/%64%6f%63%73/%66%61%71%2e%67%6d%69 in lagrange, it does work. C?me
Glad to hear that you realize how it's more complex than you originally thought. > > - Resolving relative IRIs would require parsing the current IRI, > > and the provided one, and combining them. You cannot just copy it > > to make the request. > > > > Also true, but it should be the same value that is extracted for DNS. No. I'm referring to things like this: => /docs/ => example.gmi => dir/test/foo.gmi => //gus.guru/ These are all relative in some way, and they must be resolved in reference to the IRI (or for Gemini right now, the URI) of the current page. This is not the same as the domain that was extracted for DNS, and requires a full parser. > From what you said on irc, the situation is different between URI and IRI > because most languages have URI parsing either in their stdlib or in a well > tested known library. But, if no project use IRI, of course no one will > write a library for it, this is a chicken and egg situation here. Yep, it's a shame. But we must live with it, and so URIs are the way forward. > Also, for the purpose of a client, it seems to me the parsing needed > (domain and query extraction) is only to search for the first "/" and the last > "?", and some minor tweaks on the scheme maybe (which does not contain unicode, > I will leave the scheme alone, promise). It's always more complex than that. I'm a bit too tired to go dig into the RFCs to prove it right now, but I would not trust software that just matches some characters instead of compliantly parsing things in their entirety. This method would make Gemini more complex and easily introduce bugs. If we use URIs, we don't have to resort to this. > Note: Just tried gemini://gemini.circumlunar.space/%64%6f%63%73/%66%61%71%2e%67%6d%69 > in lagrange, it does work. Works in Amfora too, and note that also the server software (Molly Brown) is accepting and parsing it correctly into a file path. But that's expected, because it's perfectly valid to percent-encode ASCII in a URL path. Cheers, makeworld
It was thus said that the Great C?me Chilliet once stated: > Le lundi 7 d?cembre 2020, 19:00:02 CET colecmac at protonmail.com a ?crit : > > > > This would then require IRI parsing libraries, and as I have explained > > earlier, these don't exist in likely many programming languages, and > > when they do, they are third-party. > > From what you said on irc, the situation is different between URI and IRI > because most languages have URI parsing either in their stdlib or in a > well tested known library. But, if no project use IRI, of course no one > will write a library for it, this is a chicken and egg situation here. I'm looking at RFC-3987 [1] and the changes from RFC-3986 [2] are minimal, and it would be easy to modify my own URI parsing library [3] (which is based directly off the BNF of RFC-3986) but that only gets me so far. The other issue is Unicode normalization and punycode support, both of which I would have to track down existing libraries or (and I shudder to think this) write my own. > Also, for the purpose of a client, it seems to me the parsing needed > (domain and query extraction) is only to search for the first "/" and the > last "?", and some minor tweaks on the scheme maybe (which does not > contain unicode, I will leave the scheme alone, promise). And then do some Unicode normalization to match how filenames are stored on your server: http://www.example.org/résumé.html http://www.example.org/résumé.html -spc [1] https://tools.ietf.org/html/rfc3987 [2] https://tools.ietf.org/html/rfc3986 [3] https://github.com/spc476/LPeg-Parsers/blob/master/url.lua
On Mon, 07 Dec 2020 18:13:17 GMT "Adnan Maolood" <me at adnano.co> wrote: > I don't think that using unicode in addresses would decrease security > because of the way that Gemini handles client authentication. Since > client certificates are limited to certain domains and paths, the > certificate will never be applied to the wrong domain, even if it looks > the same to the user. Security might also mean knowing that I'm not unintendedly divulging any details of my browsing habits to some unknown third party, or knowing that no one can impersonate your server and pages to mislead your readers. Neither of those necessarily involve client certificates at all but are a real possibility when multiple code points can represent similar or identical glyphs. I can see how IRI and IDN may be a good idea in terms of including languages with bigger or altogether different alphabets or non-alphabets, but from the perspective of an implementer, it does add a lot of complexity and opens up to homograph attacks in an area where ASCII transliteration is already the norm. Some browsers deal with homograph attacks by displaying punycode directly based on some basic heuristic (e.g. when a hostname contains both cyrillic and latin codes). I don't know much about IRI. Web browsers for example sort of skipped on this standard in favor of the WHATWG URL spec. Personally, I think some concessions need to be made to maintain the simplicity of the protocol. The currently mandated standard is (relatively) short and simple to implement, and transliteration is already pervasive in the area of internet names and URIs. Octet encoded ASCII does have the nice property that there are no homographs, there's no normalization, there's nobidirectional text etc. and there is no database of rules that have to be applied to handle these things. That said, I think a lot can be improved on the client side without changing the standard. Clients can optionally do the ToASCII/ToUnicode dance and correspondingly automatically percent encode input and display "un-percented" paths in some circumstances. The standard only specifies what needs to be sent to the server to request a resource, and what text/gemini documents need to contain to produce a link. This opens up a lot of quality of life improvements on the user interface level. RFC 4690 is a good read on the topic of IDNs. -- Philip -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201208/55da d39f/attachment.sig>
On 12/7/20 3:46 PM, colecmac at protonmail.com wrote: > On Monday, December 7, 2020 4:00 PM, Scot<gmi1 at scotdoyle.com> wrote: > >> On 12/7/20 12:00 PM,colecmac at protonmail.com wrote: >> >>> What parsing would a client have to do? >>> >>> - Extracting the domain, so it can be punycoded for DNS lookups >> Can we be sure gemini host resolution will always use the global DNS? >> >> Section 4 of RFC 6055 cautions against assuming that all name resolution >> is using the global DNS and therefore that querying with punycode >> domain names will succeed: >> >> ? It is inappropriate for an application that calls a general-purpose >> name resolution library to convert a name to an A-label unless the >> application is absolutely certain that, in all environments where the >> application might be used, only the global DNS that uses IDNA >> A-labels actually will be used to resolve the name. > ... Do you have an example of a system/network that fails on punycode? > Yes, an organization's internal network resolver or a user's local resolver could reply to utf8 queries but not punycode queries. For example, adding the line: ? 10.99.99.1? ??.jp to /etc/hosts on Ubuntu 20.04 with resolver systemd-resolved and running the test program [1] gives this output: ? error querying 'xn--td2aa.jp': Unknown server error ? name '??.jp' has ip address 10.99.99.1 [1] #include <netdb.h> #include <stdio.h> #include <arpa/inet.h> #include <netinet/in.h> #include <sys/socket.h> int show_ip(char *name) { ? struct hostent *entry; ? entry = gethostbyname(name); ? if (entry) { ??? printf("name '%s' has ip address %s\n\n", entry->h_name, ?????????? inet_ntoa(*((struct in_addr*)entry->h_addr))); ? } else { ??? printf("error querying '%s': %s\n\n", name, hstrerror(h_errno)); ? } } int main() { ? show_ip("xn--td2aa.jp"); ? show_ip("??.jp"); }
> > > ... Do you have an example of a system/network that fails on punycode? > > Yes, an organization's internal network resolver or a user's local > resolver could reply to utf8 queries but not punycode queries. > > For example, adding the line: > > ? 10.99.99.1? ??.jp > > to /etc/hosts on Ubuntu 20.04 with resolver systemd-resolved > and running the test program [1] gives this output: > > ? error querying '??.jp': Unknown server error > > ? name '??.jp' has ip address 10.99.99.1 Thanks for the example, although it seems very contrived to me. Firefox will punycode the domain right after putting it into the address bar, for example, so any network that wants to support web browsing must use a punycoded version. I'm sure there are many other pieces of software that do the same. Your example doesn't really convince me that a Gemini browser is going to encounter a situation where doing a lookup using the punycoded domain name will be the wrong thing to do. It's not literally impossible for that to be the case, but I don't really see it being an issue at all. makeworld
On Tue, Dec 08, 2020 at 01:18:07AM +0100, Philip Linde <linde.philip at gmail.com> wrote a message of 69 lines which said: > homograph attacks Homograph attacks are basically a good way to make an english-speaking audience laugh when you show them funny Unicode problems (I've seen that several times in several meetings: the languages and scripts of other people are always funny). No bad guy use them in real life, probably because users typically never check the URI or IRI. And they exist with ASCII, too (goog1e.com...) > Some browsers deal with homograph attacks by displaying punycode > directly based on some basic heuristic (e.g. when a hostname > contains both cyrillic and latin codes). Which is awful for the UX. Note that such mangling is never done for ASCII, which clearly shows a provincial bias toward english. > Octet encoded ASCII does have the nice property that there are no > homographs, there's no normalization, This is not true. Since percent-encoding encodes bytes, there are still several ways to represent "the same" string of characters and therefore normalization remains an issue. > RFC 4690 is a good read on the topic of IDNs. No, it is a one-sided anti-internationalization rant.
On Mon, Dec 07, 2020 at 06:00:02PM +0000, colecmac at protonmail.com <colecmac at protonmail.com> wrote a message of 32 lines which said: > What parsing would a client have to do? ... > This would then require IRI parsing libraries, and as I have explained > earlier, these don't exist in likely many programming languages, and > when they do, they are third-party. For Python (a common programming language), this is not true, standard library's urlparse has no problem: % ./test-urlparse.py gemini://g?meaux.bortzmeyer.org:8965/caf?\?foo=bar Host name: g?meaux.bortzmeyer.org Port: 8965 Path: /caf? Query: foo=bar % cat test-urlparse.py #!/usr/bin/env python3 import sys import urllib.parse for url in sys.argv[1:]: components = urllib.parse.urlparse(url) print("Host name: %s" % components.hostname) if components.port is not None: print("Port: %s" % components.port) print("Path: %s" % components.path) if components.query != "": print("Query: %s" % components.query)
On Mon, Dec 07, 2020 at 09:46:06PM +0000, colecmac at protonmail.com <colecmac at protonmail.com> wrote a message of 49 lines which said: > Despite what this RFC says, I don't see what situation would actually > completely fail on punycoded domains. I guess the spec could mandate trying > with punycode first, than Unicode, but that seems needless to me. Do you > have an example of a system/network that fails on punycode? mDNS (used in Apple's Bonjour). Despite its name, it has little to do with DNS, and it requires UTF-8 (and does not use Punycode). gemini://gemini.bortzmeyer.org/rfc-mirror/rfc6762.txt
On Mon, Dec 07, 2020 at 06:37:51PM -0500, Sean Conner <sean at conman.org> wrote a message of 69 lines which said: > The "horrible added complexity" is not RFC-3987 per se, but the > "horrible added complexity" of Unicode normalization that is > required. [...] Methinks you inadverantly answered your own > question---Unicode is *not* easy It would be hard to claim that Unicode is easy :-) But, to be fair, the complexity is in human scripts (for instance the lowercase/uppercase difference, which creates a lot of problems). Unicode just reflects this complexity of human writings.
On Tue, 8 Dec 2020 11:29:24 +0100 Stephane Bortzmeyer <stephane at sources.org> wrote: > For Python (a common programming language), this is not true, standard > library's urlparse has no problem: Similar results in Go: --- code package main import ( "fmt" "net/url" "os" ) func main() { for _, arg := range os.Args[1:] { u, err := url.Parse(arg) if err != nil { panic(err) } fmt.Printf("%q %q %q\n", u.Hostname(), u.Path, u.Query) } } --- However, this still leaves the problem of punycoding and worse, normalization, to some other piece of code. In Go, normalization is in the text package. ToASCII/ToUnicode implementations are in golang.org/x/net/idna Not sure if Python will normalize by default. -- Philip -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201208/fc40 4c5c/attachment-0001.sig>
As yet another data point, Java's standard library contains a class (java.net.URI) that correctly parses URIs with non-ASCII characters in their paths and query params, but it chokes when they are in the domain name. Therefore, URIs like this should work fine with Space Age: gemini://gemeaux.bortzmeyer.org:8965/caf??foo?y=bar?y But this is a non-starter: gemini://g?meaux.bortzmeyer.org:8965/caf??foo?y=bar?y It looks like there is an incomplete and poorly documented implementation of RFC 3987 (IRI) and RFC 3986 (URI) in Apache Jena (https://jena.apache.org/documentation/notes/iri.html), but it's a rather heavyweight addition to an otherwise very concise Gemini server. I'll keep an eye on this thread to see what the community ultimately decides to do about IRI/IDN. Happy hacking, Gary -- GPG Key ID: 7BC158ED Use `gpg --search-keys lambdatronic' to find me Protect yourself from surveillance: https://emailselfdefense.fsf.org ======================================================================= () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments Why is HTML email a security nightmare? See https://useplaintext.email/ Please avoid sending me MS-Office attachments. See http://www.gnu.org/philosophy/no-word-attachments.html
So far Python, Go, and Java libs have been mentioned as sort of working with IRI. It's cool the Python and Go ones seem to work, but I wouldn't trust them, because they aren't intended to support IRIs. The term IRI appears nowhere in either of their docs, and so there could easily be subtle bugs. The Java one is literally called URI, and as Gary Johnson explained, it has issues. However, these issues are all irrelevant in the face two things: - This is the IDN thread, there's a separate thread for IRI :) - IRIs would be a breaking change to Gemini, which is not feasible or a good idea. Cheers, makeworld
It was thus said that the Great Stephane Bortzmeyer once stated: > On Tue, Dec 08, 2020 at 01:18:07AM +0100, > Philip Linde <linde.philip at gmail.com> wrote > a message of 69 lines which said: > > > homograph attacks > > Homograph attacks are basically a good way to make an english-speaking > audience laugh when you show them funny Unicode problems (I've seen > that several times in several meetings: the languages and scripts of > other people are always funny). No bad guy use them in real life, > probably because users typically never check the URI or IRI. True, there's no need currently for homograph attacks if other, simpler means are available. > And they exist with ASCII, too (goog1e.com...) True. But a more concerning attack is bitsquatting [1], a much harder attack to thwart. Is it widely used? Hard to say actually. > > Some browsers deal with homograph attacks by displaying punycode > > directly based on some basic heuristic (e.g. when a hostname > > contains both cyrillic and latin codes). > > Which is awful for the UX. Note that such mangling is never done for > ASCII, which clearly shows a provincial bias toward english. > > > Octet encoded ASCII does have the nice property that there are no > > homographs, there's no normalization, > > This is not true. Since percent-encoding encodes bytes, there are > still several ways to represent "the same" string of characters and > therefore normalization remains an issue. Yes, but by "normalization" they mean precomosed characters (like "\u{00E9}") vs. combining characters (like "e\u{0301}"), along with the ordering of consecutive combining characters. > > RFC 4690 is a good read on the topic of IDNs. > > No, it is a one-sided anti-internationalization rant. Aside from the "internationalization is hard", what's so bad about the document? Remember, they *are* (or *were*) trying to retrofit internationalization into protocols that were never designed for it. -spc [1] http://www.dinaburg.org/bitsquatting.html
It was thus said that the Great Stephane Bortzmeyer once stated: > On Mon, Dec 07, 2020 at 09:46:06PM +0000, > colecmac at protonmail.com <colecmac at protonmail.com> wrote > a message of 49 lines which said: > > > Despite what this RFC says, I don't see what situation would actually > > completely fail on punycoded domains. I guess the spec could mandate trying > > with punycode first, than Unicode, but that seems needless to me. Do you > > have an example of a system/network that fails on punycode? > > mDNS (used in Apple's Bonjour). Despite its name, it has little to do > with DNS, and it requires UTF-8 (and does not use Punycode). I was curious about this, having written a DNS library [1]. Saying it has nothing to do with DNS while being called "Multicast DNS", using the same encoding scheme as DNS, and covers a portion of the DNS namespace, saying it has "little to do with DNS" is a bit uncharitable (in my opinion). It *is* DNS, over UDP---it just uses a special IP address and different port. I was also surprised that UTF-8 characters *are* possible in DNS packets [2]. I was, however, a bit disappointed that "g?meaux.bortzmeyer.org" and "xn--gmeaux-bva.bortzmeyer.org" didn't exist. -spc [1] https://github.com/spc476/SPCDNS [2] And I was happy to see my library could successfully deal with such, even as I wasn't conseciously aware of doing so.
On Tue, Dec 08, 2020 at 05:20:38PM -0500, Sean Conner <sean at conman.org> wrote a message of 29 lines which said: > I was also surprised that UTF-8 characters *are* possible in DNS packets > [2]. It is possible from the beginning, and it has been explicitely said in RFC 2181, twenty-three years ago. "any binary string whatever can be used as the label of any resource record". gemini://gemini.bortzmeyer.org/rfc-mirror/rfc2181.txt > I was, however, a bit disappointed that "g?meaux.bortzmeyer.org" and > "xn--gmeaux-bva.bortzmeyer.org" didn't exist. Then it means your DNS resolver is broken because xn--gmeaux-bva.bortzmeyer.org (the A-label, the Punycode form) is in the DNS. % dig AAAA +noidnout xn--gmeaux-bva.bortzmeyer.org ; <<>> DiG 9.11.5-P4-5.1+deb10u2-Debian <<>> AAAA +noidnout xn--gmeaux-bva.bortzmeyer.org ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 18694 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 4, AUTHORITY: 7, ADDITIONAL: 7 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags: do; udp: 4096 ;; QUESTION SECTION: ;xn--gmeaux-bva.bortzmeyer.org. IN AAAA ;; ANSWER SECTION: xn--gmeaux-bva.bortzmeyer.org. 86400 IN CNAME radia.bortzmeyer.org. xn--gmeaux-bva.bortzmeyer.org. 86400 IN RRSIG CNAME 8 3 86400 ( 20201218215750 20201204120249 10731 bortzmeyer.org. gKtCZZKsjTLdFsSKYtgvz1S+pRkZbxweG+6XOxVhJgYG gRzfWB8lhjSPaQ6BNK6YyGQreonObF1R43MDY5oQ66ti hNOfPp3/gz4wm5eAy3uzFi7xiwclshsLd0yZEaOPsTo6 fYKfRp5XCG/yZOg85YdZxJB9LK9q+RIyOycGmI0= ) radia.bortzmeyer.org. 86400 IN AAAA 2001:41d0:302:2200::180 ...
Hello again > > An URL is an address, in the same way that a phone number or an IP > > is an address. Ideally these are globally unique, unambiguous and > > representable everywhere. This address scheme should be independent > > of a localisation. > > > > We don't insist that phone numbers are rendered in roman > > numerals either. My dialing prefix isn't +XXVII. The > > gemini:// prefix isn't tweeling:// in dutch. > > This theory, in the world of domain names, is wrong. RFC 2277 says... Your reliance on one RFC as an authority while rejecting another RFC as "a one-sided anti-internationalization rant" does not strike me as being consistent. > > reducing security, > > That's false. I stil wait to see an actual phishing email with > Unicode. Most of the time, the phisher does not even bother to have a > realistic URL, they advertise <http://evil.example/famousbank> and it > works (few people check URL). > > Anyway, the goal of Gemini is not to do onli banking so this is not > really an issue. There exists a neat quote by a certain B. Russel on people who are so very sure of themselves. The gemini spec fixes the url length in octets. Various ways of encoding internationalised data may make it possible for a bad guy to shrink and grow urls in unexpected ways and clobber this buffer. The interaction between filesystems, archiving software or protocol gateways generates many more aliasing problems. > Now, identifiers, like domain names, > are a complicated case, since they are both protocol elements and > text. But, since they are widely visible (in advertisments, business > cards, etc), I believe they should be internationalized, too. Imagine a slightly different world where people don't exchange business cards, but a small amount of sheet music - their own personal jingle (retrofuturism, right ?). It turns out sheet music music is annotated in Italian - I think it can say things like "forte" or "pianissimo". Would you ago around and angrily cross out those words to replace them with your local language ? > > subtle ambiguities (is the cyrilic C the same as a latin - C, who > > knows ?), > > There is no ambiguity, U+0421 is different from U+0043. There are various insults starting with latin C. Rewriting them to start with cyrilic C doesn't make them any less insulting. > > Using unicode in addresses balkanises this global space > > The english-speaking space is not a global space: it is the space of a > minority of the world population. [WARNING: wall of text ahead] I think here we are heading to core of the argument... of what a language is. And it is a big split that many don't know how to articulate: Some see language as a core part of their identity (who they are) - others see language as a tool for communicating (a protocol). I think tying ones identity to a nation/ethnicity and its language sets one up for conflict both internally (who one is) and externally (between states). It is also silly - languages actually evolve quite rapidly and leave significant imprints on each other, while people migrate (or get conquered, sadly). So I think it is better *not* to view english as the property of a particular ethnicity, but as as a popular communications protocol - an earlier protocol might have been latin, which left significant influences on english - and if mandarin (or hindi, or whatever) ends up displacing english in turn, then I expect there to be many traces of english to be left there too. It is easy to envy native english speakers - that they have it easier. But that is not true - being multilingual is a real advantage, in so many ways: Being able to speak an extra language, for instance, is a major protective factor against dementia... and every extra language one learns makes it easier to learn the next. Bible scholars have no issues acquiring a decent grasp of Hebrew and Ancient Greek, philosophers might try to read Imanuel Kant in German. Most of us have arbitrarily tied our identities to a nation state and thus a that nation's language - it really doesn't have to be that way. regards marc
> On Dec 9, 2020, at 13:20, marc <marcx2 at welz.org.za> wrote: > > Most of us have arbitrarily tied our identities to > a nation state and thus a that nation's language - it > really doesn't have to be that way. > Amen to that. International provincialism of sort. "That gibberish he talked was Cityspeak, gutter talk, a mishmash of Japanese, Spanish, German, what have you. I didn?t really need a translator. I knew the lingo, every good cop did. But I wasn?t going to make it easier for him." ?Rick Deckard Nevertheless, it would be nice to be able to type UTF-8 directly in gemini requests & text/gemini links and have everything magically work without much ado. That would be progress for once. Word of the week: https://en.wikipedia.org/wiki/Retrofuturism
It was thus said that the Great Stephane Bortzmeyer once stated: > On Tue, Dec 08, 2020 at 05:20:38PM -0500, > Sean Conner <sean at conman.org> wrote > a message of 29 lines which said: > > > I was, however, a bit disappointed that "g?meaux.bortzmeyer.org" and > > "xn--gmeaux-bva.bortzmeyer.org" didn't exist. > > Then it means your DNS resolver is broken because > xn--gmeaux-bva.bortzmeyer.org (the A-label, the Punycode form) is in > the DNS. Then I'm going to say this was operator error because I was able to look up xn--gmeaux-bva.bortzmeyer.org. -spc
---