💾 Archived View for gemi.dev › gemini-mailing-list › 000532.gmi captured on 2023-11-04 at 12:54:32. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
Hi, all. The discussion on IRIs and IDNs is a little intense, and I thought I would take a step back and do some reading on it. I'm not monolingual, but I am ISO-8859-1-lingual, if that makes sense, so some of the issues are new to me. So, there's an overview of all the issues involved here: https://www.w3.org/International/articles/idn-and-iri/. This article (from 2008) goes over the things you need to do to implement support for IRIs, without going too much into the technical details. It makes things look pretty straightforward and cut-and-dried, but... In terms of actual standardization, things are kind of a mess. See this page: https://www.w3.org/International/wiki/IRIStatus. This page brings up the real issues with the standard. It seems like the effort to standardize IRIs in the same framework as URLs, URNs, and URIs fell apart in 2014. The effort was picked up by the HTML5 WHATWG, which has their own "living standard" called URL: http://url.spec.whatwg.org/. The URL standard focuses somewhat on parsing/processing/serializing international URLs, which is useful to us, but it is also *extremely* WWW-centric. It doesn't really take into account non-HTTP(S) URLs, especially ones that are not very web-like, like mailto or schemes where the authority field is not a domain name. Much of the spec focuses on things like how a web browser should represent URLs in the address bar and in text. This *probably* contributes to the lack of IRI-parsing libraries for various languages: there's no standard for them to implement! Given all that... maybe we should just consider our use cases and see what the minimum we have to do is? As I see it, the main requirement is that authors want to be able to use non-ASCII characters in both the domain part and the path part of the links in their documents, and have that work with no problems. IMO this is a *reasonable expectation* for a retrofuturistic protocol like Gemini. Now, what does that require of client authors and server authors? What is the *absolute minimum* we can require of client and server authors and have things work? -- +-----------------------------------------------------------+ | Jason F. McBrayer jmcbray at carcosa.net | | A flower falls, even though we love it; and a weed grows, | | even though we do not love it. -- Dogen |
It was thus said that the Great Jason McBrayer once stated: > > Now, what does that require of client authors and server authors? > > What is the *absolute minimum* we can require of client and server > authors and have things work? As I've stated, I've created an IRI parser per RFC-3987 [1] and it was a very minimal change to my original URL parser per RFC-3986 [2]. Basically, it allows any UTF-8 character past codepoint 128 to used, as is in the IRI. Languages that have URL parsers may or may not support UTF-8 data. So IRI parsing may or may not be an issue (aside from Unicode normalization) on a per-language basis. I've also started down the punycode rabit hole. As Stephane has stated, DNS *can* support UTF-8, but such support isn't wide, nor is it a standard. Punycode was developed to encode UTF-8 with ASCII in a most Byzantine way. It does have an RFC (RFC-3492) and said RFC does contain code for encoding and decoding punycode (but it's in C, and the API is ... not what I would define but it can be worked with). IDN support, from my experience over the past two days, is *harder* than IRI, although the concern was mostly the other way. I haven't actually *gotten* to the part of converting a domain name to punycode but in general, to convert a domain name: for each label [3]: if label has non-ASCII characters convert to punycode, prepend "xn--" to result so a domain name like "??.english.s?d?r.???" is converted thusly: ?? -> 99zt52a -> xn--99zt52a english -> (no conversion required) s?d?r -> sdr-rlad -> xn--sdr-rlad ??? -> wgbh1c -> xn--wgbh1c to become "xn--99zt52a.english.xn--sdr-rlad.xn--wgbh1c" (and that last segment is giving my editor fits because it's right-to-left). The example is extreme but it's just there to serve as an example of how to go about it. So given my experiences so far, I would say the easiest way to deal with all this is to make it a client issue. Hold off on IDN support for now (see below for some more questions about it), but UTF-8 in the path and query should be allowed in text/gemini, but encoded before making a request. A client, given a link like: => gemini://gemini.bortzmeyer.org:8965/caf??foo=bar Order from the Caf? should be able to parse it with the UTF-8 characters, but convert it to: gemini://gemini.bortzmeyer.org:8965/caf%C3%A9?foo=bar before making the request. At the very least, tools could be developed to encode links in text/gemini before publishing them if no one wants the spec to change at all. I feel that would be the easiest, less breaking, thing to do now. Making IDN (punycode) mandatory might require a bit more discussion as I'm not sure of the language support. I'm not even sure what name should be in a certificate for an IDN---the full UTF-8 version, or the punycode version, or both? What's currently done in HTTP land about this? (answering this will at least point in a direction, even if we don't want to go that direction). -spc [1] https://github.com/spc476/LPeg-Parsers/blob/master/iri.lua [2] https://github.com/spc476/LPeg-Parsers/blob/master/url.lua [3] The domain name "gemini.conman.org" has three labels, "gemini", "conman" and "org". The term "label" is DNS lingo.
On Tue, Dec 08, 2020 at 09:49:50PM -0500, Jason McBrayer <jmcbray at carcosa.net> wrote a message of 48 lines which said: [Thanks for the detailed analysis of the issue] > this is a *reasonable expectation* for a retrofuturistic protocol > like Gemini. I love "retrofuturistic" and I hope it will be used in the official specification.
On Wed, Dec 09, 2020 at 12:26:51AM -0500, Sean Conner <sean at conman.org> wrote a message of 73 lines which said: > DNS *can* support UTF-8, but such support isn't wide, nor is it a > standard. Wrong. DNS 2181, which clarifies that "any binary string whatever can be used as the label of any resource record" is part of the Standards Track. The reasons why few people use UTF-8 in domain names are:
Stephane Bortzmeyer <stephane at sources.org> writes: > On Wed, Dec 09, 2020 at 12:26:51AM -0500, > Sean Conner <sean at conman.org> wrote > a message of 73 lines which said: > >> DNS *can* support UTF-8, but such support isn't wide, nor is it a >> standard. > > Wrong. DNS 2181, which clarifies that "any binary string whatever can > be used as the label of any resource record" is part of the Standards > Track. The reasons why few people use UTF-8 in domain names are: How widespread is support in client resolver libraries and in servers, though? It's one thing to say, "yes, the standard is to support non-ASCII names", and another entirely to say "just sending non-ASCII names to your DNS server will work". > There is an implemention of Punycode in every standard library, > whatever your language. Not strictly true. There isn't one in Common Lisp's standard library, for example. There is one in Quicklisp, the widely used package repository though, so that's okay for me. I kind of feel like we should just bite the bullet and admit that a fully-compliant client needs to punycode domains when looking them up, and encode URLs when sending them to the server, and that fully-compliant servers need to decode URLs when resolving them. -- +-----------------------------------------------------------+ | Jason F. McBrayer jmcbray at carcosa.net | | A flower falls, even though we love it; and a weed grows, | | even though we do not love it. -- Dogen |
On Wed, Dec 09, 2020 at 10:08:25AM -0500, Jason McBrayer wrote: [punycode library] > There is one in Quicklisp, Glad to hear that! :) Bye! C.
> On Dec 9, 2020, at 09:38, Stephane Bortzmeyer <stephane at sources.org> wrote: > > There is an implemention of Punycode in every standard library, > whatever your language. There is also GNU Libidn (idn): # idn --quiet r?ksm?rg?s.se bl?b?rgr?d.no xn--rksmrgs-5wao1o.se xn--blbrgrd-fxak7p.no https://www.gnu.org/software/libidn/manual/html_node/Invoking-idn.html
It was thus said that the Great Stephane Bortzmeyer once stated: > On Wed, Dec 09, 2020 at 12:26:51AM -0500, > Sean Conner <sean at conman.org> wrote > a message of 73 lines which said: > > > It does have an RFC (RFC-3492) and said RFC does contain code for > > encoding and decoding punycode (but it's in C, and the API is > > ... not what I would define but it can be worked with). > > There is an implemention of Punycode in every standard library, > whatever your language. > > > so a domain name like "??.english.s?d?r.???" is converted thusly: > > In Python (but it is as simple in any other language): > > >>> print(codecs.encode("??.English.s?d?r.???", encoding="idna")) > b'xn--99zt52a.English.xn--sdr-rlad.xn--wgbh1c' > > (Note that the encodings.idna library of Python standard library is > limited to IDN v1.) > > So, almost nothing to do for the programmer. I don't agree with your > assessment that IDN is simpler than IRI. I'm sorry, but the two languages I work in do *not* have an implementation of punycode in their standard library. I *was* able to find code for C (from the RFC, which at least I know will work per the RFC) and could not find one for Lua. There's a reason why I'm having to muck with this. The API I have for C is *not* set up to handle domain names (breaking out the labels, prepending or removing the "xn--", etc.). It's wonderful that the language you use comes with punycode support in its standard library. Not all languages have that. I'm looking at the list of clients [1] and there's one client written in a language I haven't heard of before (Vala). Other languages used are Nim, scheme and Tcl. I would be surprised if Vala or Nim have a punycode implementation. -spc (But hey, write your own client that does eveything you want and show us all how easy it is) [1] gemini://gemini.circumlunar.space/software/
> I'm sorry, but the two languages I work in donot have an implementation > of punycode in their standard library. Isn't this somewhat irrelevant in this case? It's unfortunate they don't have an implementation, but as it stands right now, most Gemini clients will not handle Unicode domain names at all. Punycoding domains will solve that issue, and languages that don't have it in their stdlib can either use a third-party library, or if that's not possible then those languages will just ignore punycoding entirely. Obviously that's not great for those languages, but it doesn't make sense to me to not have punycoding at all because of that. At the end of the day, Gemini clients must be allowed to support Unicode domains. Perhaps the term "SHOULD" as defined by RFC2119 should be used in the spec in this case. makeworld
> On Dec 9, 2020, at 23:17, Sean Conner <sean at conman.org> wrote: > > could not find one for Lua Would this suit? https://github.com/haste/lua-idn/blob/master/idn.lua
It was thus said that the Great Petite Abeille once stated: > > > On Dec 9, 2020, at 23:17, Sean Conner <sean at conman.org> wrote: > > > > could not find one for Lua > > Would this suit? > > https://github.com/haste/lua-idn/blob/master/idn.lua Huh ... turns out I should have searched for 'IDN' instead of 'punycode'. The code is for Lua 5.2---it will take a bit of work to get it working for Lua 5.3 but that still leaves normalization issues. I'm working with the GNU libidn right now because it can do normailzation, otherwise, I can get two different IDNs for the same (visually) domain: r?sum? resume-jxde r?sum? rsum-bpad -spc
> On Dec 10, 2020, at 00:23, Sean Conner <sean at conman.org> wrote: > > r?sum? resume-jxde > r?sum? rsum-bpad Talking of which: Lua Parser for Punycode/IDN Homograph Attack https://community.rsa.com/community/products/netwitness/blog/2017/04/24/lua -parser-for-punycodeidn-homograph-attack
The lua wiki has various libs for normalization: http://lua-users.org/wiki/LuaUnicode I also mentioned libicu earlier, which is a very commonly used lib for unicode handling, and the wiki even mentions a lua library that provides bindings to libicu (although the bindings may be out of date). https://github.com/unicode-org/icu Hope that helps
> On Dec 10, 2020, at 01:01, William Orr <will at worrbase.com> wrote: > > The lua wiki has various libs for normalization: http://lua-users.org/wiki/LuaUnicode Indeed, ustring sports some normalization routines: https://github.com/wikimedia/mediawiki-extensions-Scribunto/tree/master/inc ludes/engines/LuaCommon/lualib/ustring While on the topic, do people validate all UTF-8 coming their way? Ala iconv -f UTF-8 -t UTF-8?
It was thus said that the Great colecmac at protonmail.com once stated: > > I'm sorry, but the two languages I work in donot have an implementation > > of punycode in their standard library. > > Isn't this somewhat irrelevant in this case? It was more a comment about this quote: > There is an implemention of Punycode in every standard library, > whatever your language. There is *not* an implementation of punycode in every standard library, whatever your language. In a lot of currently in-use languages? Probably, but not *all*. > It's unfortunate they don't > have an implementation, but as it stands right now, most Gemini clients > will not handle Unicode domain names at all. And I'm not seeing anyone else trying to update clients to do this, even if in an exploratory nature. What? Are they just waiting for a decree? > Punycoding domains will > solve that issue, and languages that don't have it in their stdlib > can either use a third-party library, or if that's not possible then > those languages will just ignore punycoding entirely. Poking around the GNU libidn documentation, I found Appendix B [1] worrisome because the IDN rabbit hole just got deeper with U+2024 (ONE DOT LEADER) and U+2485 (DIGIT FIVE FULL STOP). Should I worry about it? I don't know. This internationalization stuff is complex and makes me want to throw up hands in the air, scream a bit, and go back to the simplicity of ASCII. In the end, I'll probably just do Unicode normalization, then punycode and call it a day. > Obviously that's > not great for those languages, but it doesn't make sense to me to not > have punycoding at all because of that. > > At the end of the day, Gemini clients must be allowed to support Unicode > domains. Perhaps the term "SHOULD" as defined by RFC2119 should be used > in the spec in this case. So when are you going to update gemget and Amfora to support punycode? I've heard it's easy to do. Or are you waiting for a spec change first? -spc (Seriously, I feel like I'm the only one *doing* anything here) [1] https://www.gnu.org/software/libidn/manual/html_node/On-Label-Separators.html
> > It's unfortunate they don't > > have an implementation, but as it stands right now, most Gemini clients > > will not handle Unicode domain names at all. > > And I'm not seeing anyone else trying to update clients to do this, even > if in an exploratory nature. What? Are they just waiting for a decree? > > [snip] > > So when are you going to update gemget and Amfora to support punycode? > I've heard it's easy to do. Or are you waiting for a spec change first? > > -spc (Seriously, I feel like I'm the only one doing anything here) Yes, I am waiting for Solderpunk. On the subject of IDNs, it seems obvious to me that punycode will accepted as the thing to do for DNS, and that Unicode should be sent to the server, but some questions around certs and normalization still remain. I've outlined them here[1]. I suppose you're correct about being the only one doing anything, but I don't feel like it makes sense to do anything yet. The solution is simple code-wise (for IDNs), and so I don't feel the need to experiment, and I'd rather implement this once, in-line with the spec, rather than multiple times if Solderpunk says something different. I guess this is just two different approaches to handling issues with a spec. Gemini is intended to be very strict and not extensible, and is driven by its spec rather than what people end up doing in the wild, like on the Web. I don't think you're doing something wrong or bad by publicly experimenting, but I'd rather not make things more uncertain by implementing something in non-toy/demo clients before it's official. 1: https://github.com/makeworld-the-better-one/go-gemini/issues/10 Cheers, makeworld
It was thus said that the Great colecmac at protonmail.com once stated: > > > It's unfortunate they don't > > > have an implementation, but as it stands right now, most Gemini clients > > > will not handle Unicode domain names at all. > > > > And I'm not seeing anyone else trying to update clients to do this, even > > if in an exploratory nature. What? Are they just waiting for a decree? > > > > [snip] > > > > So when are you going to update gemget and Amfora to support punycode? > > I've heard it's easy to do. Or are you waiting for a spec change first? > > > > -spc (Seriously, I feel like I'm the only one doing anything here) > > Yes, I am waiting for Solderpunk. On the subject of IDNs, it seems obvious > to me that punycode will accepted as the thing to do for DNS, And why would Solderpunk choose this if no one has bothered to even look into the possible issues wit respect to coding? > and that > Unicode should be sent to the server, and that would be a breaking change on the protocol. Just having the client accept IRIs and send URIs wouldn't change the protocol. But aside for me, NO ONE bothered to even test this out! > but some questions around certs > and normalization still remain. I've outlined them here[1]. > > I suppose you're correct about being the only one doing anything, but I > don't feel like it makes sense to do anything yet. The solution is simple > code-wise (for IDNs), and so I don't feel the need to experiment, And if it's so simple, why not do it? But I get it, you'd rather wait until a yeah/nay decision is made. I mean, who wouldn't like single digit resonse codes no client certificates a link line of [text|url] no virtual hosting a request format ala gopher (including TABs!) no rediection no indication of pages are actually gone vs not found no MIME parameters That's pretty much what the Gemini spec *would have been* had some people
I've been following along with my own software in the background. First of all, my domain registrar won't even let me put unicode characters in an A record without automatically converting them to punycode for me. caf?.mozz.us -> xn--caf-dma.mozz.us Next, my naive python test client just kind of works as-is [0][1]. It will convert unicode DNS names to punycode under the hood before doing the lookup. Any unicode in the URL (IRI?) is left alone because.. why would a client ever muck around with the URL that the user gives them? That sounds like a bad idea to me. My server (running jetforce) also works as-is. All I had to do was add an entry for "caf?.mozz.us" as a recognized hostname, and there you go. ``` jetforce-client gemini://caf?.mozz.us Welcome to AV-98! Enjoy your patrol through Geminispace... ?? WELCOME TO MOZZ.US ?? ``` Requesting unicode path names also works with no changes on my part ``` jetforce-client gemini://caf?.mozz.us/files/?????.txt 20 text/plain This is a test file with unicode characters in the name.? ``` As do quoted path names (the server will unquote the URL before it attempts to load the file) ``` jetforce-client gemini://caf?.mozz.us/files/%F0%9D%92%BB%F0%9D%92%B6%F0%9D%93%83%F0%9D%92%B 8%F0%9D%93%8E.txt 20 text/plain This is a test file with unicode characters in the name. ``` Does this mean my server is already compliant? What else should I try? - Michael [0] https://github.com/michael-lazar/jetforce/blob/master/jetforce_client.py [1] It's nice to finally get a win for python after fighting with TLS for so long
It was thus said that the Great Michael Lazar once stated: > I've been following along with my own software in the background. Thank you. Without an implementation it is difficult to see where the landmines are. So, with that said ... > First of all, my domain registrar won't even let me put unicode characters > in an A record without automatically converting them to punycode for me. > > caf?.mozz.us -> xn--caf-dma.mozz.us Okay. > Next, my naive python test client just kind of works as-is [0][1]. It will > convert unicode DNS names to punycode under the hood before doing the lookup. > Any unicode in the URL (IRI?) is left alone because.. why would a > client ever muck > around with the URL that the user gives them? That sounds like a bad idea to > me. That's debatable. The percent encoding doesn't change the meaning, just the "envelope" so-to-speak. > My server (running jetforce) also works as-is. All I had to do was add an entry > for "caf?.mozz.us" as a recognized hostname, and there you go. Okay, about that. I modified my own stupid-simple client to support IRIs and to convert the hostname via punycode (finally!). The code changes in the client weren't that large (once I got the punycode module written, it was one line to switch from URI parsing to IRI parsing, one line to add the punycode module, and one line modified to punycode the host when making a connection) but I'm encountering an issue. If I use: gemini://caf?.mozz.us/files/?????.txt (and send that as the request) It works, and I get the file. But when I go to: gemini://xn--caf-dma.mozz.us/files/%F0%9D%92%BB%F0%9D%92%B6%F0%9D%93%83%F0 %9D%92%B8%F0%9D%93%8E.txt (and send that as the request) I get an error 53 (no proxy allowed). When I go to: gemini://caf?.mozz.us/files/%F0%9D%92%BB%F0%9D%92%B6%F0%9D%93%83%F0%9D%92% B8%F0%9D%93%8E.txt (and send that as the request) it works as well. I would expect the second example to work along with the first and third examples. They all reference the same resource in the same server. Another issue that I've thought of, the length of each request---the first is 53 bytes, the second is 99 bytes and the third is 93 bytes. This *could* be an issue with respect to the the overall limit of 1024 bytes for a request. As far as servers go, GLV-1.12556 still uses the URL parser, and would choke on an IRI being given as a request (since it expects non-ASCII characters to be encoded per RFC-3986). That would be an easy fix for me (just switch to the IRI parser) but allowing IRIs would be an actual change to the protocol. I'm just saying. > Does this mean my server is already compliant? What else should I try? Perhaps allow "xn--caf-dma.mozz.us" as a hostname? -spc > [0] https://github.com/michael-lazar/jetforce/blob/master/jetforce_client.py > [1] It's nice to finally get a win for python after fighting with TLS > for so long
> On Dec 10, 2020, at 02:16, Sean Conner <sean at conman.org> wrote: > > -spc (Seriously, I feel like I'm the only one *doing* anything here) :)) To be is to do. ?Socrates To do is to be. ?Plato Do-be-do-be-do. ?Sinatra https://quoteinvestigator.com/2015/01/26/doing/
> On Dec 10, 2020, at 05:16, Michael Lazar <lazar.michael22 at gmail.com> wrote: > > Next, my naive python test client just kind of works as-is [0][1]. It will > convert unicode DNS names to punycode under the hood before doing the lookup. Perhaps of interest: How do I know when to do a UTF8 or punycode DNS query? https://stackoverflow.com/questions/16837513/how-do-i-know-when-to-do-a-utf 8-or-punycode-dns-query
It was thus said that the Great Petite Abeille once stated: > > > > On Dec 10, 2020, at 02:16, Sean Conner <sean at conman.org> wrote: > > > > -spc (Seriously, I feel like I'm the only one *doing* anything here) > > :)) > > To be is to do. ?Socrates > To do is to be. ?Plato > Do-be-do-be-do. ?Sinatra You forgot: Do-do-do-do. -Serling -spc (There is a fifth dimension, beyond that which is known to man ...)
> On Dec 10, 2020, at 09:22, Sean Conner <sean at conman.org> wrote: > > Do-do-do-do. -Serling ?It may be said with a degree of assurance that not everything that meets the eye is as it appears.? ? Rod Serling, The Twilight Zone
On Wed, Dec 09, 2020 at 08:16:34PM -0500, Sean Conner <sean at conman.org> wrote a message of 51 lines which said: > This internationalization stuff is complex and makes me want to > throw up hands in the air, scream a bit, and go back to the > simplicity of ASCII. ASCII is not simple (think of case-insensitivity) and then only for people whose latin is the first script they learned.
On Wed, Dec 09, 2020 at 11:16:30PM -0500, Michael Lazar <lazar.michael22 at gmail.com> wrote a message of 55 lines which said: > First of all, my domain registrar won't even let me put unicode > characters in an A record without automatically converting them to > punycode for me. Small detail: it is an issue with the DNS *hoster*. Which may be the same as the registrar or not (if you host your own authoritative name servers). > My server (running jetforce) also works as-is. But not Gemserv (I have to figure out why).
On Wed, Dec 09, 2020 at 10:25:40PM -0500, Sean Conner <sean at conman.org> wrote a message of 53 lines which said: > And if it's so simple, why not do it? But I get it, you'd rather wait > until a yeah/nay decision is made. It is reasonable to discuss it first because we need a *standard* way of doing it. Clients and servers must agree or there will be no interoperability. Also, I suspect the problem is partially a social one: programs are written by programmers. Most programmers are familiar with english and with the latin script. Therefore, the issue does not seem pressing for most of them.
Hi > > This internationalization stuff is complex and makes me want to > > throw up hands in the air, scream a bit, and go back to the > > simplicity of ASCII. > > ASCII is not simple (think of case-insensitivity) and then only for > people whose latin is the first script they learned. I am struggling to take that statement seriously, and not just because it breaks set theory :-) Case conversion in ascii is xor 0x20 - that doesn't even require a branch/comparison and can compile down to a single assembly instruction. This versus *many* tens or even hundreds of thousands of lines of puny/unicode/etc logic. But lets assume upper/lowercase characters in ascii are confusing. That would be an argument to restrict a simple system such as gemini urls to a subset of ascii which excludes uppercase characters. Which I could support, and which is effectively what dns ends up doing - as do the majority of http urls. "Lowest common denominator for maximum interoperability" is a good maxim. If ascii case conversion is confusing, then this isn't an excuse to grow this confusion by many orders of magnitude. That makes the problem a lot worse. "Oops, I've burnt my toast - I know, lets solve that by burning down the house" regards marc
Petite Abeille <petite.abeille at gmail.com> writes: >> On Dec 10, 2020, at 05:16, Michael Lazar <lazar.michael22 at gmail.com> wrote: >> >> Next, my naive python test client just kind of works as-is [0][1]. It will >> convert unicode DNS names to punycode under the hood before doing the lookup. > > Perhaps of interest: > > How do I know when to do a UTF8 or punycode DNS query? > https://stackoverflow.com/questions/16837513/how-do-i-know-when-to-do-a-u tf8-or-punycode-dns-query You can unconditionally just run the punycode encoder over domain names, though ? an all-ASCII domain name will be unchanged. The stackoverflow question deals with non-internet domain names in Active Directory, which we don't have to support. -- +-----------------------------------------------------------+ | Jason F. McBrayer jmcbray at carcosa.net | | A flower falls, even though we love it; and a weed grows, | | even though we do not love it. -- Dogen |
> On Dec 10, 2020, at 15:20, Jason McBrayer <jmcbray at carcosa.net> wrote: > > non-internet domain names in Active Directory, which > we don't have to support. Hmmm... so... no .local queries ala Cheshire? https://tools.ietf.org/html/rfc6762#appendix-F Perhaps worthwhile quoting in full: Appendix F. Use of UTF-8 After many years of debate, as a result of the perceived need to accommodate certain DNS implementations that apparently couldn't handle any character that's not a letter, digit, or hyphen (and apparently never would be updated to remedy this limitation), the Unicast DNS community settled on an extremely baroque encoding called "Punycode". Punycode is a remarkably ingenious encoding solution, but it is complicated, hard to understand, and hard to implement, using sophisticated techniques including insertion unsort coding, generalized variable-length integers, and bias adaptation. The resulting encoding is remarkably compact given the constraints, but it's still not as good as simple straightforward UTF-8, and it's hard even to predict whether a given input string will encode to a Punycode string that fits within DNS's 63-byte limit, except by simply trying the encoding and seeing whether it fits. Indeed, the encoded size depends not only on the input characters, but on the order they appear, so the same set of characters may or may not encode to a legal Punycode string that fits within DNS's 63-byte limit, depending on the order the characters appear. This is extremely hard to present in a user interface that explains to users why one name is allowed, but another name containing the exact same characters is not. Neither Punycode nor any other of the "ASCII- Compatible Encodings" proposed for Unicast DNS may be used in Multicast DNS messages. Any text being represented internally in some other representation must be converted to canonical precomposed UTF-8 before being placed in any Multicast DNS message.
On Thu, Dec 10, 2020 at 10:15:42AM +0100, Stephane Bortzmeyer wrote: Hi folks! > On Wed, Dec 09, 2020 at 10:25:40PM -0500, > Sean Conner <sean at conman.org> wrote > a message of 53 lines which said: > > > And if it's so simple, why not do it? But I get it, you'd rather wait > > until a yeah/nay decision is made. > > It is reasonable to discuss it first because we need a *standard* way > of doing it. Clients and servers must agree or there will be no > interoperability. I have read most of the messages in this thread, i would just say that one of the problems with WWW is that browser are getting not manageable by a single user or a hobbyist programmer. This issue lead to centralization (see chromium and the company behind) as most of us can easily see. Adding more complexity and more responsibility to software authors will shrink diversity in the gemini software landscape. Of course i talk here as a client author here but in the niche language i chose (common lisp) i was forced to write my URL parsing procedure, adding i18n domains will require a lot of work because the third party library (the only one) Jason McBrayer wrote some message above does not implement punycode->unicode conversion (if i checked the right library, thanks to the author anyway, better than nothing!). Probably many of you just thinking "who cares about CL?", and probably this is the mindset that lead to the mess it is now the web. Internationalized hostname has advantages but how this adding complexity impact software author? Is this complexity needed? I have no answer, just would want to express some of my concerns. Bye! C. PS: i am not a native English speaker (as you can see :-) )
> On Dec 10, 2020, at 16:54, cage <cage-dev at twistfold.it> wrote: > > Internationalized hostname has advantages but how this adding > complexity impact software author? Is this complexity needed? Ah, yes, le charme discret du r?gionalisme. It all boils down to the unmeasurable joy of Unicode ? > I have no answer, just would want to express some of my concerns. Archibald 'Harry' Tuttle had about 3 lines in Terry Gilliam's' Brazil: ? Well, that's a pipe of a different color. ? Listen, this whole system of yours could be on fire and I couldn't even turn on the kitchen tap without filling out a twenty-seven B stroke six... bloody paperwork. ? Listen, kid, we're all in it together. This sums it up in terms of retrofitting Unicode into ASCII: https://www.youtube.com/watch?v=VRfoIyx8KfU Perhaps Unicode should be abandoned altogether, and we all move back to to the original simplicity of ASCII.
On Thu, Dec 10, 2020 at 05:27:05PM +0100, Petite Abeille wrote: > > > > On Dec 10, 2020, at 16:54, cage <cage-dev at twistfold.it> wrote: > > > > Internationalized hostname has advantages but how this adding > > complexity impact software author? Is this complexity needed? > > Ah, yes, le charme discret du r?gionalisme. > > It all boils down to the unmeasurable joy of Unicode ? > > > I have no answer, just would want to express some of my concerns. > > Archibald 'Harry' Tuttle had about 3 lines in Terry Gilliam's' Brazil: One of my favourite movie! I love the "retr? style" terminal shown in the office! :) Bye! C.
On Thu, Dec 10, 2020 at 10:55 AM cage <cage-dev at twistfold.it> wrote: > I have read most of the messages in this thread, i would just say that > one of the problems with WWW is that browser are getting not > manageable by a single user or a hobbyist programmer. > This is not strictly a Web issue: it is a DNS issue and affects all protocols, including FTP, email, Gopher, etc. > Probably many of you just thinking "who cares about CL?" > As a Schemer, I definitely do care about it. > Internationalized hostname has advantages but how this adding > complexity impact software author? Is this complexity needed? > It's a balancing act between the needs of software authors and the needs of content authors. If Gemini succeeds, the latter will be much more common. I think that internationalized link lines (which will have to become part of the definition of text/gemini) are very important to authors. Whether clients accept IRIs in the address bar (or equivalent) is up to the client author. And I very strongly feel that changing the Gemini *protocol* to pass IRIs serves nobody and shouldn't even be considered. > PS: i am not a native English speaker (as you can see :-) ) > Good. As the saying is, if you want to know if there is antisemitism in a particular place, ask the Jews who live there. John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org I now introduce Professor Smullyan, who will prove to you that either he doesn't exist or you don't exist, but you won't know which. --Melvin Fitting
On Thu, Dec 10, 2020 at 11:27 AM Petite Abeille <petite.abeille at gmail.com> wrote: > It all boils down to the unmeasurable joy of Unicode ? > As someone who is intimately familiar with i18n in the pre-Unicode era, I can say that things were 100 times worse then. Unicode is flawed because it had to compromise with existing encodings, which is why we need normalization. But without that compromise (which permits 1-1 convertibility from almost all encodings to and from Unicode), it would never have been so widely adopted. > Perhaps Unicode should be abandoned altogether, and we all move back to to > the original simplicity of ASCII. Perhaps we should abandon all modern languages and just use Latin on the Internet. John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org Gules six bars argent on a canton azure 50 mullets argent six five six five six five six five and six --blazoning the U.S. flag <http://web.meson.org/blazonserver>
It was thus said that the Great Stephane Bortzmeyer once stated: > On Wed, Dec 09, 2020 at 10:25:40PM -0500, > Sean Conner <sean at conman.org> wrote > a message of 53 lines which said: > > > And if it's so simple, why not do it? But I get it, you'd rather wait > > until a yeah/nay decision is made. > > It is reasonable to discuss it first because we need a *standard* way > of doing it. Clients and servers must agree or there will be no > interoperability. Okay, Here's a IRI: gemini://caf?.mozz.us/files/?????.txt Please specify what a client and server MUST do to properly handle this. -spc
Sean Conner <sean at conman.org> writes: > Okay, Here's a IRI: > > gemini://caf?.mozz.us/files/?????.txt > > Please specify what a client and server MUST do to properly handle this. Well, if I'm following all of these conversations correctly to date, I believe the procedure looks like this: 1. Punycode the hostname. 2. Percent-encode reserved characters and non-US-ASCII characters in the path, query, and fragment components. 3. Make a DNS query with the punycoded hostname. 4. Send the punycode + percent-encoded URI as the request to the Gemini server. 5. The server parses the URI into scheme, host, port, path, query, and fragment components and then percent-decodes the path, query, and fragment strings. 6. The parsed and decoded URI information can then either be used to perform a file retrieval, generate a directory listing, or run a CGI script, ultimately sending back a valid Gemini response to the client. Redirect responses should make sure to percent-encode the path, query, and fragment components of the redirected URI. My Gemini server (Space Age) handles steps 5 and 6 as described here (as I suspect most Gemini servers do). Clients should already be performing step 2 as per the Gemini spec. I suspect the missing piece of the puzzle here is *just* having client authors implement steps 1, 3, and 4 (for some definition of "just"). I don't think these client changes would require any changes to the current Gemini spec. There is also the open question of whether servers should convert punycoded hostnames back into unicode hostnames for the purposes of virtual hosting (either via SNI or post-handshake). Since at least one poster has indicated that the widespread unevenness in DNS support for unicode has lead to the need to store A records in their punycoded form, this suggests to me that virtual hosting may be performed most universally by just directly matching the received punycoded domain names. Of course, YMMV. Happy hacking, Gary -- GPG Key ID: 7BC158ED Use `gpg --search-keys lambdatronic' to find me Protect yourself from surveillance: https://emailselfdefense.fsf.org ======================================================================= () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments Why is HTML email a security nightmare? See https://useplaintext.email/ Please avoid sending me MS-Office attachments. See http://www.gnu.org/philosophy/no-word-attachments.html
On Thu, Dec 10, 2020 at 8:12 PM Gary Johnson <lambdatronic at disroot.org> wrote: 1. Punycode the hostname. > If there is one. You can look for "//" on the left and the next "/" on the right, so you don't need full parsing. > 2. Percent-encode reserved characters and non-US-ASCII characters in the > path, query, and fragment components. > You don't want to escape the ASCII reserved characters, because they should already be escaped. Changing the path /foo/bar.gmi to %25foo%25bar.gmi would be Evil and Wrong. If you really want that path, you have to encode it yourself. In addition, you can safely %-encode the whole IRI reference without parsing it, since Punycode names are always safe. 2.5. If the IRI is a relative reference, resolve it against the URI of the text/gemini file that contains it. 3. Make a DNS query with the punycoded hostname. > > 4. Send the punycode + percent-encoded URI as the request to the Gemini > server. > Note that fragments must not be sent, so if there is one, chop it off. > 5. The server parses the URI into scheme, host, port, path, query, and > fragment components and then percent-decodes the path, query, and > fragment strings. > Consequently, the server will not get a fragment string. There would be no need for fragment strings if they were understood on the server side; they'd just be part of the path. Whether it %-decodes or not is up to the server. If it's serving a conventional file system, then it needs to document whether it does such decoding. If it isn't, it can do whatever it wants to with the paths. > 6. The parsed and decoded URI information can then either be used to perform a file retrieval, generate a directory listing, or run a CGI > script, ultimately sending back a valid Gemini response to the > client. Redirect responses should make sure to percent-encode the > path, query, and fragment components of the redirected URI. > Except not the fragment. > Since at least one > poster has indicated that the widespread unevenness in DNS support for > unicode has lead to the need to store A records in their punycoded form, > Indeed, I don't think that any registrar using the standard DNS root will even register non-punycoded names. MS Active Directory DNS servers are another story. John Cowan http://vrici.lojban.org/~cowan cowan at ccil.org This great college [Trinity], of this ancient university [Cambridge], has seen some strange sights. It has seen Wordsworth drunk and Porson sober. And here am I, a better poet than Porson, and a better scholar than Wordsworth, somewhere betwixt and between. --A.E. Housman
> 4. Send the punycode + percent-encoded URI as the request to the Gemini > server. This probably makes sense since IRIs aren't being used. I originally advocated for sending Unicode to the server, for the domain only, but that's just a weird mix isn't it. Good server software should be taking the admin's hostname input (from config) and punycoding it though, so that the admin can enter the Unicode domain name and not have to worry. Obviously this is outside of the spec, but I think it's a good thing to implement. makeworld
Before percent-encoding/punycoding, the URI needs to be NFC normalized. As a matter of course, I'd say that servers should normalize the path before doing fs lookups/proxying it as well.
On Thu, Dec 10, 2020 at 04:54:52PM +0100, cage <cage-dev at twistfold.it> wrote a message of 44 lines which said: > Adding more complexity and more responsibility to software authors > will shrink diversity in the gemini software landscape. The idea is that it will not be done by the programmer (Unicode is complicated) of the Gemini client but mostly by the libraries she or he uses. It is the same with TLS: TLS is very complicated but most people do not program it by themselves (and rightly so). > Internationalized hostname has advantages but how this adding > complexity impact software author? Is this complexity needed? Since I write Gemini clients, I have sympathy for this point of view. However, let me quote RFC 8890 => gemini://gemini.bortzmeyer.org/rfc-mirror/rfc8890.txt 4.5. Deprioritizing Internal Needs There are several needs that are very visible to us as specification authors but should explicitly not be prioritized over the needs of end users. These include convenience for document editors, IETF process matters, and "architectural purity" for its own sake. => https://www.w3.org/TR/html-design-principles/#priority-of-constituencies See aso this statement by W3C
On Thu, Dec 10, 2020 at 04:09:19PM -0500, John Cowan <cowan at ccil.org> wrote a message of 76 lines which said: > Unicode is flawed because it had to compromise with existing > encodings, And also because human scripts (and languages) are a mess and Unicode choosed, a long time ago, to deal with it instead of whining "why can't they just speak english?" > Perhaps we should abandon all modern languages and just use Latin on the > Internet. => https://en.wikipedia.org/wiki/Lojban No, Lojban
On Thu, Dec 10, 2020 at 08:12:04PM -0500, Gary Johnson <lambdatronic at disroot.org> wrote a message of 69 lines which said: > 1. Punycode the hostname. Not always, for the reasons explained in RFC 6055. To summarize: the application does not always know which name resolution system will be used. gemini://gemini.bortzmeyer.org/rfc-mirror/rfc6055.txt > 3. Make a DNS query with the punycoded hostname. Most applications don't do DNS queries, both because DNS is complicated and because there are other name resolutions system. They call a system routine (getaddrinfo() in C) to do the resolution. > There is also the open question of whether servers should convert > punycoded hostnames back into unicode hostnames for the purposes of > virtual hosting (either via SNI or post-handshake). Since at least one > poster has indicated that the widespread unevenness in DNS support for > unicode has lead to the need to store A records in their punycoded form, > this suggests to me that virtual hosting may be performed most > universally by just directly matching the received punycoded domain > names. This is what Apache and Nginx do in the Web world (which does not mean they are right).
On Thu, Dec 10, 2020 at 09:45:55PM -0500, John Cowan <cowan at ccil.org> wrote a message of 177 lines which said: > Indeed, I don't think that any registrar using the standard DNS root > will even register non-punycoded names. Counter-example about a similar case: the registry of .ws accepts names with emojis, which are forbidden by the standard (because they are symbols, not letters). So, anything can happen. => https://www.worldsite.ws/idn/emoji.dhtml?sponsor=index.dhtml And they boast about it (Also, not all name registration goes through a registrar: when I edit bortzmeyer.org, I can add what I want without any intermediary.)
On Fri, Dec 11, 2020 at 09:57:02AM +0100, Stephane Bortzmeyer wrote: Hi! > On Thu, Dec 10, 2020 at 04:54:52PM +0100, > cage <cage-dev at twistfold.it> wrote > a message of 44 lines which said: > > > Adding more complexity and more responsibility to software authors > > will shrink diversity in the gemini software landscape. > > The idea is that it will not be done by the programmer (Unicode is > complicated) of the Gemini client but mostly by the libraries she or > he uses. If such library does exists, otherwise more and more works is needed, and this will exclude the programmers that have no time or skills (to me very likely the former) to do the all the work. > It is the same with TLS: TLS is very complicated but most > people do not program it by themselves (and rightly so). This is matter of balance, the advantages of TLS are worth the complexity added, IDN? I am not sure. > > Internationalized hostname has advantages but how this adding > > complexity impact software author? Is this complexity needed? > > Since I write Gemini clients, I have sympathy for this point of > view. However, let me quote RFC 8890 OK the point is valid to me, for TLS, but not for IDN. Anyway i have the impression i am in a minority here, and i think i should start to do a minimal wrapping of libidn at this point. :-) Bye! C.
On Thu, Dec 10, 2020 at 03:02:44PM -0500, John Cowan wrote: Hi! > On Thu, Dec 10, 2020 at 10:55 AM cage <cage-dev at twistfold.it> wrote: > > > > I have read most of the messages in this thread, i would just say that > > one of the problems with WWW is that browser are getting not > > manageable by a single user or a hobbyist programmer. > > > > This is not strictly a Web issue: it is a DNS issue and affects all > protocols, including FTP, email, Gopher, etc. Correct, i meant that this is a client issue deriving from debatable choices. > > Probably many of you just thinking "who cares about CL?" > > > > As a Schemer, I definitely do care about it. [OT] Nice! Do yo have a preferred dialect? I like Guile a lot but i fear i end missing CLOS (GOOPS is not the same, unfortunately) and condition system. > > > Internationalized hostname has advantages but how this adding > > complexity impact software author? Is this complexity needed? > > > > It's a balancing act between the needs of software authors and the needs of > content authors. If Gemini succeeds, the latter will be much more common. > I think that internationalized link lines (which will have to become part > of the definition of text/gemini) are very important to authors. Whether > clients accept IRIs in the address bar (or equivalent) is up to the client > author. And I very strongly feel that changing the Gemini *protocol* to > pass IRIs serves nobody and shouldn't even be considered. I agree! My only concerns is i have the impression that client that will not supports IRI will be second class citizen in the gemini space, and they will die slowly. So i think that IRI will be a de facto standard. :/ Bye! C.
colecmac at protonmail.com writes: > Good server software should be taking the admin's hostname input (from config) > and punycoding it though, so that the admin can enter the Unicode domain name > and not have to worry. Obviously this is outside of the spec, but I think > it's a good thing to implement. To implement, *and* to document ? if not in the spec, then in a 'best practices for implementers' document. -- Jason McBrayer | ?Strange is the night where black stars rise, jmcbray at carcosa.net | and strange moons circle through the skies, | but stranger still is lost Carcosa.? | ? Robert W. Chambers,The King in Yellow
On Fri, Dec 11, 2020 at 4:13 AM Stephane Bortzmeyer <stephane at sources.org> wrote: > Counter-example about a similar case: the registry of .ws accepts > names with emojis, which are forbidden by the standard (because they > are symbols, not letters). So, anything can happen. However, those names are still punycoded. Most registries will not accept punycoded emojis, but .ws does.
John Cowan <cowan at ccil.org> writes: >> 2. Percent-encode reserved characters and non-US-ASCII characters in the >> path, query, and fragment components. > You don't want to escape the ASCII reserved characters, because they should > already be escaped. Changing the path /foo/bar.gmi to %25foo%25bar.gmi > would be Evil and Wrong. If you really want that path, you have to encode > it yourself. Yes, that is quite right. I suppose we are using a different interpretation of the phrase "reserved characters" here. For clarity, I meant characters such as those in the string " ?#", which are either forbidden (when unencoded) within the path, query, and fragment components or are used to delimit them. > 2.5. If the IRI is a relative reference, resolve it against the URI of the > text/gemini file that contains it. Yep. >> 4. Send the punycode + percent-encoded URI as the request to the Gemini >> server. > > Note that fragments must not be sent, so if there is one, chop it off. I'm not sure that is the case here. To quote the Gemini spec: ======================================================================== 1.2 Gemini URI scheme Resources hosted via Gemini are identified using URIs with the scheme "gemini". This scheme is syntactically compatible with the generic URI syntax defined in RFC 3986, but does not support all components of the generic syntax. In particular, the authority component is allowed and required, but its userinfo subcomponent is NOT allowed. The host subcomponent is required. The port subcomponent is optional, with a default value of 1965. The path, query and fragment components are allowed and have no special meanings beyond those defined by the generic syntax. Spaces in gemini URIs should be encoded as %20, not +. ======================================================================== Please note the text about fragment components being allowed. I'm not currently aware of any good uses for them in Gemini, but the spec supports them, so I've included that support in my server. >> 5. The server parses the URI into scheme, host, port, path, query, and >> fragment components and then percent-decodes the path, query, and >> fragment strings. > > Consequently, the server will not get a fragment string. There would be no > need for fragment strings if they were understood on the server side; > they'd just be part of the path. See above. >> 6. The parsed and decoded URI information can then either be used to >> perform a file retrieval, generate a directory listing, or run a >> CGI script, ultimately sending back a valid Gemini response to >> the client. Redirect responses should make sure to percent-encode >> the path, query, and fragment components of the redirected URI. >> > > Except not the fragment. Again, see above. Yada yada...spec compliance...yada yada. Happy hacking, Gary -- GPG Key ID: 7BC158ED Use `gpg --search-keys lambdatronic' to find me Protect yourself from surveillance: https://emailselfdefense.fsf.org ======================================================================= () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments Why is HTML email a security nightmare? See https://useplaintext.email/ Please avoid sending me MS-Office attachments. See http://www.gnu.org/philosophy/no-word-attachments.html
On Friday, December 11, 2020 4:58 AM, cage <cage-dev at twistfold.it> wrote: > I agree! My only concerns is i have the impression that client that > will not supports IRI will be second class citizen in the gemini > space, and they will die slowly. > > So i think that IRI will be a de facto standard. :/ I do not think this will happen, and if it starts to happen I will fight against it. I don't think the idea of "de facto standards" fits within the Gemini ethos at all. It's not supposed to be extensible and clients aren't supposed go off and do random things while others have to decide what to do and play catch-up. That is how the Web grew and became more complex, and it's why we have only a few browsers today. The ecosystem benefits when we all just stick to the standard, with the perhaps obvious exception of demos and toys. Stay united, Gemini! Cheers, makeworld
It was thus said that the Great Stephane Bortzmeyer once stated: > On Thu, Dec 10, 2020 at 08:12:04PM -0500, > Gary Johnson <lambdatronic at disroot.org> wrote > a message of 69 lines which said: > > > 1. Punycode the hostname. > > Not always, for the reasons explained in RFC 6055. To summarize: the > application does not always know which name resolution system will be used. Yes, this is why I hata the incessent talking. If you bothered to try it on a few systems, you may have encountered *not* encoding with punycode
It was thus said that the Great colecmac at protonmail.com once stated: > On Friday, December 11, 2020 4:58 AM, cage <cage-dev at twistfold.it> wrote: > > > I agree! My only concerns is i have the impression that client that > > will not supports IRI will be second class citizen in the gemini > > space, and they will die slowly. > > > > So i think that IRI will be a de facto standard. :/ > > I do not think this will happen, and if it starts to happen I will > fight against it. There's a reason why UTF-8 was selected as the default character set for text/gemini, and one of those is to allow other people than English speakers a means of expressing themsevles [1]. I don't think it's entirely unreasonable to expect such a person to use Unicode for both domain name and filenames [2]. Yes, tooling could be made to handle "canonicalizing" links [3] but why not look into allowing IRIs? Without an attempt at it, it would be difficult to know what would work, what doesn't and where the difficulty, if any, lie. *That's* why I'm so insistent on coding up "proof-of-concepts". Just decreeing "this is how it shall be done" rarely works out well [4]. And decreeing "this shall NOT be done" could put off non-technical, non-English speaking people. [5] > I don't think the idea of "de facto standards" fits within the Gemini > ethos at all. It's not supposed to be extensible and clients aren't > supposed go off and do random things while others have to decide what to > do and play catch-up. That is how the Web grew and became more complex, > and it's why we have only a few browsers today. And this is working out if the specification should be ammended to allow IRIs, and if not, at at least have a jutification. > The ecosystem benefits when we all just stick to the standard, with the > perhaps obvious exception of demos and toys. One more point of reference. The Gopher RFC (RFC-1436) states the use of ISO-8859-1 for a character set. It is wrong then, for gopher servers to serve up UTF-8 documents even though it's not standard? Yes, gopher is not Gemini, but UTF-8 does seem to be a modern "de facto standard" in gopherspace. -spc [1] For example, gemini://blekksprut.net/ [2] Otherwise, punycode wouldn't exist. [3] Conversion from IRI to URI, with Unicode normalization, prior to publication. [4] Such as X.200---lovingly developed and standardized but no one used it. Or Xanadu. Over 60 years of design work and still not working. [5] And I'm saying this being a thoroughly American mut that speaks only English.
It was thus said that the Great cage once stated: > On Fri, Dec 11, 2020 at 09:57:02AM +0100, Stephane Bortzmeyer wrote: > > > Since I write Gemini clients, I have sympathy for this point of > > view. However, let me quote RFC 8890 > > OK the point is valid to me, for TLS, but not for IDN. Anyway i have > the impression i am in a minority here, and i think i should start to > do a minimal wrapping of libidn at this point. :-) Here's the code I wrote to wrap libidn: https://github.com/spc476/lua-conmanorg/blob/master/src/idn.c It's geared for Lua, but the code itself is in C, but it should be pretty easy to see what is going on. -spc
On Fri, Dec 11, 2020 at 06:49:17PM -0500, Sean Conner wrote: [...] > > Here's the code I wrote to wrap libidn: > > https://github.com/spc476/lua-conmanorg/blob/master/src/idn.c > > It's geared for Lua, but the code itself is in C, but it should be pretty > easy to see what is going on. Thank you Sean! I am, in fact starting to wrap libidn (actually libidn2) so far seems that i got a working unicode->ascii function. Of course i will be happy to share the results and maybe, if some people are going to be interested (and if i succeed! :)) i could extract a library from this code (it is integrated in the client at moment). Bye! C.
On Fri, Dec 11, 2020 at 08:49:45PM +0000, colecmac at protonmail.com wrote: [...] > > > > So i think that IRI will be a de facto standard. :/ > > > I do not think this will happen, and if it starts to happen I will > fight against it. I don't think the idea of "de facto standards" fits within > the Gemini ethos at all. It's not supposed to be extensible and clients > aren't supposed go off and do random things while others have to decide > what to do and play catch-up. That is how the Web grew and became more complex, > and it's why we have only a few browsers today. I am with you with this but i think the only way to let not developers do their own way is to clarify issues in the specs as much as possible. > The ecosystem benefits when we all just stick to the standard, with the perhaps > obvious exception of demos and toys. Totally agree of course! I am trying to do so! :) > Stay united, Gemini! :) ? Bye! C.
> On Dec 12, 2020, at 00:43, Sean Conner <sean at conman.org> wrote: > > There's a reason why UTF-8 was selected as the default character set for > text/gemini, and one of those is to allow other people than English speakers > a means of expressing themsevles [1]. I don't think it's entirely > unreasonable to expect such a person to use Unicode for both domain name and > filenames [2] Yes. agree. People should be able to express themselves in the most idiomatic -and frictionless- way they see fit. It's a moral imperative -and duty- for Gemini to make it so. The year is 2020, no more easy and lazy excuses. This is not a technical choice, but a moral one. Timely article in The Economist: Accent discrimination betrays a small mind https://www.economist.com/books-and-arts/2020/12/12/accent-discrimination-b etrays-a-small-mind History will judge you: be one the right side. Do the right thing Solderpunk.
On Fri, 11 Dec 2020 18:14:47 -0500 Sean Conner <sean at conman.org> wrote: > I know, because I tried on a few systems I have access to, > and they all failed to look up "caf?.mozz.us" (yes, via getaddrinfo() even). > They all worked when I looked up "xn--caf-dma.mozz.us". I don't think Stephane means different systems in the sense of different computers, but different systems as in different name resolution systems. On my computer, for example, there are at least three sources for names: DNS, mDNS and /etc/hosts. getaddrinfo() can resolve using any of these systems via the name service switch, but IDN only concerns DNS names. caf?.mozz.us in your tests probably always resolved via DNS, but if you had a caf?.local mDNS name or "caf?" in your hosts file you might not be able to use IDN. I don't know about the hosts file, but mDNS for example uses UTF-8 encoded names directly. -- Philip
On Mon, Dec 14, 2020 at 11:46:49AM +0100, Philip Linde <linde.philip at gmail.com> wrote a message of 45 lines which said: > but different systems as in different name resolution systems. Yes. > caf?.mozz.us in your tests probably always resolved via DNS, but if > you had a caf?.local mDNS name or "caf?" in your hosts file you > might not be able to use IDN. I don't know about the hosts file, but > mDNS for example uses UTF-8 encoded names directly. I just tested with a Debian box and it seems getaddrinfo (both from a Python program and from a C one, ping), requires the name in /etc/hosts to be present in Punycode form (A-label). The good news is that the Python program does not have to do punycoding itself, it is handled automatically by the standard library.
On Sat, Dec 12, 2020 at 12:10:48PM +0100, cage wrote: > On Fri, Dec 11, 2020 at 06:49:17PM -0500, Sean Conner wrote: > > [...] > > > > > Here's the code I wrote to wrap libidn: > > > > https://github.com/spc476/lua-conmanorg/blob/master/src/idn.c > > > > It's geared for Lua, but the code itself is in C, but it should be pretty > > easy to see what is going on. > > Thank you Sean! I am, in fact starting to wrap libidn (actually > libidn2) so far seems that i got a working unicode->ascii function. Of > course i will be happy to share the results and maybe, if some people > are going to be interested (and if i succeed! :)) i could extract a > library from this code (it is integrated in the client at moment). FWIW i managed to switch from URI to IRI in my client, fortunately i was able to reuse most of the URL parser. For Punycode i wrapped a C library. Not sure if i did everything right but have met no regression so far. If other lisper (CL) are interested we could arrange a library from this code. Bye! C.
> On Dec 14, 2020, at 16:09, cage <cage-dev at twistfold.it> wrote: > > FWIW i managed to switch from URI to IRI in my client, Bravo! :) "Don't worry about what anybody else is going to do. The best way to predict the future is to invent it." ? Alan Kay
On Mon, Dec 14, 2020 at 04:37:19PM +0100, Petite Abeille wrote: > > > > On Dec 14, 2020, at 16:09, cage <cage-dev at twistfold.it> wrote: > > > > FWIW i managed to switch from URI to IRI in my client, > > Bravo! :) Thank you! :) Honestly most of the hard work has been done by libidn and two excellent lisp libraries; sometimes people complains that CL is full of half-baked libraries but there are some high quality too, FFI and parser generator are two that excels, in my opinion. Also the author of the latter actually helped me spotting a bug in the code. :) Bye! C.
It was thus said that the Great Philip Linde once stated: > On Fri, 11 Dec 2020 18:14:47 -0500 > Sean Conner <sean at conman.org> wrote: > > > I know, because I tried on a few systems I have access to, > > and they all failed to look up "caf?.mozz.us" (yes, via getaddrinfo() even). > > They all worked when I looked up "xn--caf-dma.mozz.us". > > I don't think Stephane means different systems in the sense of > different computers, but different systems as in different name > resolution systems. Yes, I understand that, but on the systems I used, all FAILED to resolve "caf?.mozz.us". Does that mean the client just gives up and says "domain not found?" because local configuration doesn't work with UTF-8 domain names? That, to me, sounds like what Stephane is advocating for when they say "no conversion to punycode". It's wonderful that Stephane's language du jour will apparently handle it for the user, but are the rest of us out of luck? [1] THIS is what I'm asking about. -spc [1] And a response of "here's a nickel, get yourself a real computer language" is NOT a valid resonse.
Hello, I didn't see any email mentioning this so I thought I'd share the link here. Lagrange[1] has gone ahead with IDN support, and the details are found in this post[2] by skyjake. The relevant points are as follows. > * The full URL is NFC normalized before sending it to a server. > * Domain names with non-ASCII characters are encoded to Punycode before > doing a DNS lookup. The Punycode version of the domain name is sent to > the server in the request URL, and also used for verifying the server > certificate. This is what I plan on doing in Amfora as well. I will defer to Solderpunk's judgement, which is coming[3], but until then that's my plan. The only difference is that I was planning on allowing both punycoded domains and IDNs in certs, to simplify things for sysadmins. But if Lagrange isn't allowing it, then maybe I shouldn't... this is quickly approaching "de facto standard" territory. For now I will err on the permissive side in that case, allowing both, but this is something I'd like hear from Solderpunk on. gemget will do the same, as it uses the go-gemini[4] library as well. 1: https://gmi.skyjake.fi/lagrange/ 2: gemini://skyjake.fi/gemlog/2020-12_idns-in-lagrange.gmi 3: gemini://gemini.circumlunar.space/~solderpunk/pikkulog/2020-12.gmi 4: https://github.com/makeworld-the-better-one/go-gemini/issues/10 Cheers, makeworld
---