💾 Archived View for gemi.dev › gemini-mailing-list › 000532.gmi captured on 2023-11-04 at 12:54:32. Gemini links have been rewritten to link to archived content

View Raw

More Information

➡️ Next capture (2023-12-28)

-=-=-=-=-=-=-

Some reading on IRIs and IDNs

Jason McBrayer <jmcbray (a) carcosa.net>

Hi, all. The discussion on IRIs and IDNs is a little intense, and I
thought I would take a step back and do some reading on it. I'm not
monolingual, but I am ISO-8859-1-lingual, if that makes sense, so some
of the issues are new to me.

So, there's an overview of all the issues involved here:
https://www.w3.org/International/articles/idn-and-iri/. This article
(from 2008) goes over the things you need to do to implement support for
IRIs, without going too much into the technical details. It makes things
look pretty straightforward and cut-and-dried, but...

In terms of actual standardization, things are kind of a mess. See this
page: https://www.w3.org/International/wiki/IRIStatus. This page brings
up the real issues with the standard.

It seems like the effort to standardize IRIs in the same framework as
URLs, URNs, and URIs fell apart in 2014. The effort was picked up by the
HTML5 WHATWG, which has their own "living standard" called URL:
http://url.spec.whatwg.org/. The URL standard focuses somewhat on
parsing/processing/serializing international URLs, which is useful to
us, but it is also *extremely* WWW-centric. It doesn't really take into
account non-HTTP(S) URLs, especially ones that are not very web-like,
like mailto or schemes where the authority field is not a domain name. 
Much of the spec focuses on things like how a web browser should
represent URLs in the address bar and in text.

This *probably* contributes to the lack of IRI-parsing libraries for
various languages: there's no standard for them to implement!

Given all that... maybe we should just consider our use cases and see
what the minimum we have to do is?

As I see it, the main requirement is that authors want to be able to use
non-ASCII characters in both the domain part and the path part of the
links in their documents, and have that work with no problems. IMO this
is a *reasonable expectation* for a retrofuturistic protocol like
Gemini.

Now, what does that require of client authors and server authors?

What is the *absolute minimum* we can require of client and server
authors and have things work?

-- 
+-----------------------------------------------------------+
| Jason F. McBrayer                    jmcbray at carcosa.net  |
| A flower falls, even though we love it; and a weed grows, |
| even though we do not love it.            -- Dogen        |

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great Jason McBrayer once stated:
>   
> Now, what does that require of client authors and server authors?
> 
> What is the *absolute minimum* we can require of client and server
> authors and have things work?

  As I've stated, I've created an IRI parser per RFC-3987 [1] and it was a
very minimal change to my original URL parser per RFC-3986 [2].  Basically,
it allows any UTF-8 character past codepoint 128 to used, as is in the IRI. 
Languages that have URL parsers may or may not support UTF-8 data.  So IRI
parsing may or may not be an issue (aside from Unicode normalization) on a
per-language basis.

  I've also started down the punycode rabit hole.  As Stephane has stated,
DNS *can* support UTF-8, but such support isn't wide, nor is it a standard. 
Punycode was developed to encode UTF-8 with ASCII in a most Byzantine way. 
It does have an RFC (RFC-3492) and said RFC does contain code for encoding
and decoding punycode (but it's in C, and the API is ... not what I would
define but it can be worked with).  IDN support, from my experience over the
past two days, is *harder* than IRI, although the concern was mostly the
other way.  I haven't actually *gotten* to the part of converting a domain
name to punycode but in general, to convert a domain name:

	for each label [3]:
		if label has non-ASCII characters
			convert to punycode, prepend "xn--" to result

so a domain name like "??.english.s?d?r.???" is converted thusly:

	?? 	-> 99zt52a -> xn--99zt52a
	english	-> (no conversion required)
	s?d?r 	-> sdr-rlad -> xn--sdr-rlad

	???

		-> wgbh1c -> xn--wgbh1c

to become "xn--99zt52a.english.xn--sdr-rlad.xn--wgbh1c" (and that last
segment is giving my editor fits because it's right-to-left).  The example
is extreme but it's just there to serve as an example of how to go about it.

  So given my experiences so far, I would say the easiest way to deal with
all this is to make it a client issue.  Hold off on IDN support for now (see
below for some more questions about it), but UTF-8 in the path and query
should be allowed in text/gemini, but encoded before making a request.  A
client, given a link like:

=> gemini://gemini.bortzmeyer.org:8965/caf??foo=bar Order from the Caf?

should be able to parse it with the UTF-8 characters, but convert it to:

	gemini://gemini.bortzmeyer.org:8965/caf%C3%A9?foo=bar

before making the request.  At the very least, tools could be developed to
encode links in text/gemini before publishing them if no one wants the spec
to change at all.  

I feel that would be the easiest, less breaking, thing to do now.  Making
IDN (punycode) mandatory might require a bit more discussion as I'm not sure
of the language support.  I'm not even sure what name should be in a
certificate for an IDN---the full UTF-8 version, or the punycode version, or
both?  What's currently done in HTTP land about this? (answering this will
at least point in a direction, even if we don't want to go that direction).

  -spc

[1]	https://github.com/spc476/LPeg-Parsers/blob/master/iri.lua

[2]	https://github.com/spc476/LPeg-Parsers/blob/master/url.lua

[3]	The domain name "gemini.conman.org" has three labels, "gemini",
	"conman" and "org".  The term "label" is DNS lingo.

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Tue, Dec 08, 2020 at 09:49:50PM -0500,
 Jason McBrayer <jmcbray at carcosa.net> wrote 
 a message of 48 lines which said:

[Thanks for the detailed analysis of the issue]

> this is a *reasonable expectation* for a retrofuturistic protocol
> like Gemini.

I love "retrofuturistic" and I hope it will be used in the official
specification.

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Wed, Dec 09, 2020 at 12:26:51AM -0500,
 Sean Conner <sean at conman.org> wrote 
 a message of 73 lines which said:

> DNS *can* support UTF-8, but such support isn't wide, nor is it a
> standard.

Wrong. DNS 2181, which clarifies that "any binary string whatever can
be used as the label of any resource record" is part of the Standards
Track. The reasons why few people use UTF-8 in domain names are:




  may react badly to non-ASCII domain names,


  names (the case-insensitivity rule) but it does not extend to
  Unicode.

> It does have an RFC (RFC-3492) and said RFC does contain code for
> encoding and decoding punycode (but it's in C, and the API is
> ... not what I would define but it can be worked with).

There is an implemention of Punycode in every standard library,
whatever your language.

> so a domain name like "??.english.s?d?r.???" is converted thusly:

In Python (but it is as simple in any other language):

>>> print(codecs.encode("??.English.s?d?r.???", encoding="idna"))
b'xn--99zt52a.English.xn--sdr-rlad.xn--wgbh1c'

(Note that the encodings.idna library of Python standard library is
limited to IDN v1.)

So, almost nothing to do for the programmer. I don't agree with your
assessment that IDN is simpler than IRI.

> I'm not even sure what name should be in a certificate for an
> IDN---the full UTF-8 version, or the punycode version, or both?
> What's currently done in HTTP land about this? (answering this will
> at least point in a direction, even if we don't want to go that
> direction).

gemini://gemini.bortzmeyer.org/rfc-mirror/rfc8399.txt

But it depends on the CA. It seems Let's Encrypt does not want to
handle UTF-8 and requires Punycode.

> [3]	The domain name "gemini.conman.org" has three labels, "gemini",
> 	"conman" and "org".  The term "label" is DNS lingo.

Let's be picky, there are four, there is also the root :-)

Link to individual message.

Jason McBrayer <jmcbray (a) carcosa.net>

Stephane Bortzmeyer <stephane at sources.org> writes:

> On Wed, Dec 09, 2020 at 12:26:51AM -0500,
>  Sean Conner <sean at conman.org> wrote 
>  a message of 73 lines which said:
>
>> DNS *can* support UTF-8, but such support isn't wide, nor is it a
>> standard.
>
> Wrong. DNS 2181, which clarifies that "any binary string whatever can
> be used as the label of any resource record" is part of the Standards
> Track. The reasons why few people use UTF-8 in domain names are:

How widespread is support in client resolver libraries and in servers,
though? It's one thing to say, "yes, the standard is to support
non-ASCII names", and another entirely to say "just sending non-ASCII
names to your DNS server will work".

> There is an implemention of Punycode in every standard library,
> whatever your language.

Not strictly true. There isn't one in Common Lisp's standard library,
for example. There is one in Quicklisp, the widely used package
repository though, so that's okay for me.

I kind of feel like we should just bite the bullet and admit that a
fully-compliant client needs to punycode domains when looking them up,
and encode URLs when sending them to the server, and that
fully-compliant servers need to decode URLs when resolving them.

-- 
+-----------------------------------------------------------+
| Jason F. McBrayer                    jmcbray at carcosa.net  |
| A flower falls, even though we love it; and a weed grows, |
| even though we do not love it.            -- Dogen        |

Link to individual message.

cage <cage-dev (a) twistfold.it>

On Wed, Dec 09, 2020 at 10:08:25AM -0500, Jason McBrayer wrote:

[punycode library]

> There is one in Quicklisp,

Glad to hear that! :)

Bye!
C.

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 9, 2020, at 09:38, Stephane Bortzmeyer <stephane at sources.org> wrote:
> 
> There is an implemention of Punycode in every standard library,
> whatever your language.

There is also GNU Libidn (idn):

# idn --quiet r?ksm?rg?s.se bl?b?rgr?d.no
xn--rksmrgs-5wao1o.se
xn--blbrgrd-fxak7p.no

https://www.gnu.org/software/libidn/manual/html_node/Invoking-idn.html

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great Stephane Bortzmeyer once stated:
> On Wed, Dec 09, 2020 at 12:26:51AM -0500,
>  Sean Conner <sean at conman.org> wrote 
>  a message of 73 lines which said:
> 
> > It does have an RFC (RFC-3492) and said RFC does contain code for
> > encoding and decoding punycode (but it's in C, and the API is
> > ... not what I would define but it can be worked with).
> 
> There is an implemention of Punycode in every standard library,
> whatever your language.
> 
> > so a domain name like "??.english.s?d?r.???" is converted thusly:
> 
> In Python (but it is as simple in any other language):
> 
> >>> print(codecs.encode("??.English.s?d?r.???", encoding="idna"))
> b'xn--99zt52a.English.xn--sdr-rlad.xn--wgbh1c'
> 
> (Note that the encodings.idna library of Python standard library is
> limited to IDN v1.)
> 
> So, almost nothing to do for the programmer. I don't agree with your
> assessment that IDN is simpler than IRI.

  I'm sorry, but the two languages I work in do *not* have an implementation
of punycode in their standard library.  I *was* able to find code for C
(from the RFC, which at least I know will work per the RFC) and could not
find one for Lua.  There's a reason why I'm having to muck with this.  The
API I have for C is *not* set up to handle domain names (breaking out the
labels, prepending or removing the "xn--", etc.).

  It's wonderful that the language you use comes with punycode support in
its standard library.  Not all languages have that.  I'm looking at the list
of clients [1] and there's one client written in a language I haven't heard
of before (Vala).  Other languages used are Nim, scheme and Tcl.  I would be
surprised if Vala or Nim have a punycode implementation.

  -spc (But hey, write your own client that does eveything you want and show
	us all how easy it is)

[1]	gemini://gemini.circumlunar.space/software/

Link to individual message.

colecmac@protonmail.com <colecmac (a) protonmail.com>

> I'm sorry, but the two languages I work in donot have an implementation
> of punycode in their standard library.

Isn't this somewhat irrelevant in this case? It's unfortunate they don't
have an implementation, but as it stands right now, most Gemini clients
will not handle Unicode domain names at all. Punycoding domains will
solve that issue, and languages that don't have it in their stdlib
can either use a third-party library, or if that's not possible then
those languages will just ignore punycoding entirely. Obviously that's
not great for those languages, but it doesn't make sense to me to not
have punycoding at all because of that.

At the end of the day, Gemini clients must be allowed to support Unicode
domains. Perhaps the term "SHOULD" as defined by RFC2119 should be used
in the spec in this case.


makeworld

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 9, 2020, at 23:17, Sean Conner <sean at conman.org> wrote:
> 
> could not find one for Lua

Would this suit?

https://github.com/haste/lua-idn/blob/master/idn.lua

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great Petite Abeille once stated:
> 
> > On Dec 9, 2020, at 23:17, Sean Conner <sean at conman.org> wrote:
> > 
> > could not find one for Lua
> 
> Would this suit?
> 
> https://github.com/haste/lua-idn/blob/master/idn.lua

  Huh ... turns out I should have searched for 'IDN' instead of 'punycode'. 
The code is for Lua 5.2---it will take a bit of work to get it working for
Lua 5.3 but that still leaves normalization issues.  I'm working with the
GNU libidn right now because it can do normailzation, otherwise, I can get
two different IDNs for the same (visually) domain:

	r?sum?	resume-jxde
	r?sum?	rsum-bpad

  -spc

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 10, 2020, at 00:23, Sean Conner <sean at conman.org> wrote:
> 
> 	r?sum?	resume-jxde
> 	r?sum?	rsum-bpad

Talking of which:

Lua Parser for Punycode/IDN Homograph Attack
https://community.rsa.com/community/products/netwitness/blog/2017/04/24/lua
-parser-for-punycodeidn-homograph-attack

Link to individual message.

William Orr <will (a) worrbase.com>

The lua wiki has various libs for normalization: http://lua-users.org/wiki/LuaUnicode

I also mentioned libicu earlier, which is a very commonly used lib for 
unicode handling, and the wiki even mentions a lua library that provides 
bindings to libicu (although the bindings may be out of date).

https://github.com/unicode-org/icu

Hope that helps

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 10, 2020, at 01:01, William Orr <will at worrbase.com> wrote:
> 
> The lua wiki has various libs for normalization: http://lua-users.org/wiki/LuaUnicode

Indeed, ustring sports some normalization routines:

https://github.com/wikimedia/mediawiki-extensions-Scribunto/tree/master/inc
ludes/engines/LuaCommon/lualib/ustring

While on the topic, do people validate all UTF-8 coming their way? Ala 
iconv -f UTF-8 -t UTF-8?

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great colecmac at protonmail.com once stated:
> > I'm sorry, but the two languages I work in donot have an implementation
> > of punycode in their standard library.
> 
> Isn't this somewhat irrelevant in this case? 

  It was more a comment about this quote:

> There is an implemention of Punycode in every standard library,
> whatever your language.

  There is *not* an implementation of punycode in every standard library,
whatever your language.  In a lot of currently in-use languages?  Probably,
but not *all*.  

> It's unfortunate they don't
> have an implementation, but as it stands right now, most Gemini clients
> will not handle Unicode domain names at all. 

  And I'm not seeing anyone else trying to update clients to do this, even
if in an exploratory nature.  What?  Are they just waiting for a decree?

> Punycoding domains will
> solve that issue, and languages that don't have it in their stdlib
> can either use a third-party library, or if that's not possible then
> those languages will just ignore punycoding entirely. 

  Poking around the GNU libidn documentation, I found Appendix B [1]
worrisome because the IDN rabbit hole just got deeper with U+2024 (ONE DOT
LEADER) and U+2485 (DIGIT FIVE FULL STOP).  Should I worry about it?  I
don't know.  This internationalization stuff is complex and makes me want to
throw up hands in the air, scream a bit, and go back to the simplicity of
ASCII.

  In the end, I'll probably just do Unicode normalization, then punycode and
call it a day.

> Obviously that's
> not great for those languages, but it doesn't make sense to me to not
> have punycoding at all because of that.
> 
> At the end of the day, Gemini clients must be allowed to support Unicode
> domains. Perhaps the term "SHOULD" as defined by RFC2119 should be used
> in the spec in this case.

  So when are you going to update gemget and Amfora to support punycode? 
I've heard it's easy to do.  Or are you waiting for a spec change first?

  -spc (Seriously, I feel like I'm the only one *doing* anything here)

[1]	https://www.gnu.org/software/libidn/manual/html_node/On-Label-Separators.html

Link to individual message.

colecmac@protonmail.com <colecmac (a) protonmail.com>

> > It's unfortunate they don't
> > have an implementation, but as it stands right now, most Gemini clients
> > will not handle Unicode domain names at all.
>
> And I'm not seeing anyone else trying to update clients to do this, even
> if in an exploratory nature. What? Are they just waiting for a decree?
>
> [snip]
>
> So when are you going to update gemget and Amfora to support punycode?
> I've heard it's easy to do. Or are you waiting for a spec change first?
>
> -spc (Seriously, I feel like I'm the only one doing anything here)

Yes, I am waiting for Solderpunk. On the subject of IDNs, it seems obvious
to me that punycode will accepted as the thing to do for DNS, and that
Unicode should be sent to the server, but some questions around certs
and normalization still remain. I've outlined them here[1].

I suppose you're correct about being the only one doing anything, but I
don't feel like it makes sense to do anything yet. The solution is simple
code-wise (for IDNs), and so I don't feel the need to experiment, and I'd
rather implement this once, in-line with the spec, rather than multiple
times if Solderpunk says something different.

I guess this is just two different approaches to handling issues with a spec.
Gemini is intended to be very strict and not extensible, and is driven by
its spec rather than what people end up doing in the wild, like on the Web.

I don't think you're doing something wrong or bad by publicly experimenting,
but I'd rather not make things more uncertain by implementing something
in non-toy/demo clients before it's official.

1: https://github.com/makeworld-the-better-one/go-gemini/issues/10

Cheers,
makeworld

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great colecmac at protonmail.com once stated:
> > > It's unfortunate they don't
> > > have an implementation, but as it stands right now, most Gemini clients
> > > will not handle Unicode domain names at all.
> >
> > And I'm not seeing anyone else trying to update clients to do this, even
> > if in an exploratory nature. What? Are they just waiting for a decree?
> >
> > [snip]
> >
> > So when are you going to update gemget and Amfora to support punycode?
> > I've heard it's easy to do. Or are you waiting for a spec change first?
> >
> > -spc (Seriously, I feel like I'm the only one doing anything here)
> 
> Yes, I am waiting for Solderpunk. On the subject of IDNs, it seems obvious
> to me that punycode will accepted as the thing to do for DNS,

  And why would Solderpunk choose this if no one has bothered to even look
into the possible issues wit respect to coding?  

> and that
> Unicode should be sent to the server, 

and that would be a breaking change on the protocol.  Just having the client
accept IRIs and send URIs wouldn't change the protocol.  But aside for me,
NO ONE bothered to even test this out!

> but some questions around certs
> and normalization still remain. I've outlined them here[1].
> 
> I suppose you're correct about being the only one doing anything, but I
> don't feel like it makes sense to do anything yet. The solution is simple
> code-wise (for IDNs), and so I don't feel the need to experiment, 

  And if it's so simple, why not do it? But I get it, you'd rather wait
until a yeah/nay decision is made.  I mean, who wouldn't like

	single digit resonse codes
	no client certificates
	a link line of [text|url]
	no virtual hosting
	a request format ala gopher (including TABs!)
	no rediection
	no indication of pages are actually gone vs not found
	no MIME parameters

  That's pretty much what the Gemini spec *would have been* had some people

Bikeshedding with talk is easy, bikeshedding with an implementation is
tedious.

  -spc

Link to individual message.

Michael Lazar <lazar.michael22 (a) gmail.com>

I've been following along with my own software in the background.

First of all, my domain registrar won't even let me put unicode characters
in an A record without automatically converting them to punycode for me.

caf?.mozz.us -> xn--caf-dma.mozz.us

Next, my naive python test client just kind of works as-is [0][1]. It will
convert unicode DNS names to punycode under the hood before doing the lookup.
Any unicode in the URL (IRI?) is left alone because.. why would a
client ever muck
around with the URL that the user gives them? That sounds like a bad idea to
me.

My server (running jetforce) also works as-is. All I had to do was add an entry
for "caf?.mozz.us" as a recognized hostname, and there you go.

 ```
jetforce-client gemini://caf?.mozz.us
Welcome to AV-98!
Enjoy your patrol through Geminispace...
?? WELCOME TO MOZZ.US ??
 ```

Requesting unicode path names also works with no changes on my part

 ```
jetforce-client gemini://caf?.mozz.us/files/?????.txt
20 text/plain
This is a test file with unicode characters in the name.?
 ```

As do quoted path names (the server will unquote the URL before it
attempts to load the file)

 ```
jetforce-client
gemini://caf?.mozz.us/files/%F0%9D%92%BB%F0%9D%92%B6%F0%9D%93%83%F0%9D%92%B
8%F0%9D%93%8E.txt
20 text/plain
This is a test file with unicode characters in the name.
 ```

Does this mean my server is already compliant? What else should I try?

- Michael

[0] https://github.com/michael-lazar/jetforce/blob/master/jetforce_client.py
[1] It's nice to finally get a win for python after fighting with TLS
for so long

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great Michael Lazar once stated:
> I've been following along with my own software in the background.

  Thank you.  Without an implementation it is difficult to see where the
landmines are.  So, with that said ...

> First of all, my domain registrar won't even let me put unicode characters
> in an A record without automatically converting them to punycode for me.
> 
> caf?.mozz.us -> xn--caf-dma.mozz.us

  Okay.

> Next, my naive python test client just kind of works as-is [0][1]. It will
> convert unicode DNS names to punycode under the hood before doing the lookup.
> Any unicode in the URL (IRI?) is left alone because.. why would a
> client ever muck
> around with the URL that the user gives them? That sounds like a bad idea to
> me.

  That's debatable.  The percent encoding doesn't change the meaning, just
the "envelope" so-to-speak.  

> My server (running jetforce) also works as-is. All I had to do was add an entry
> for "caf?.mozz.us" as a recognized hostname, and there you go.

  Okay, about that.  I modified my own stupid-simple client to support IRIs
and to convert the hostname via punycode (finally!).  The code changes in
the client weren't that large (once I got the punycode module written, it
was one line to switch from URI parsing to IRI parsing, one line to add the
punycode module, and one line modified to punycode the host when making a
connection) but I'm encountering an issue.  If I use:

	gemini://caf?.mozz.us/files/?????.txt

(and send that as the request) It works, and I get the file.  But when I go
to:

	gemini://xn--caf-dma.mozz.us/files/%F0%9D%92%BB%F0%9D%92%B6%F0%9D%93%83%F0
%9D%92%B8%F0%9D%93%8E.txt

(and send that as the request) I get an error 53 (no proxy allowed).  When I
go to:

	gemini://caf?.mozz.us/files/%F0%9D%92%BB%F0%9D%92%B6%F0%9D%93%83%F0%9D%92%
B8%F0%9D%93%8E.txt

(and send that as the request) it works as well.  I would expect the second
example to work along with the first and third examples. They all reference
the same resource in the same server.

  Another issue that I've thought of, the length of each request---the first
is 53 bytes, the second is 99 bytes and the third is 93 bytes.  This *could*
be an issue with respect to the the overall limit of 1024 bytes for a
request.

  As far as servers go, GLV-1.12556 still uses the URL parser, and would
choke on an IRI being given as a request (since it expects non-ASCII
characters to be encoded per RFC-3986).  That would be an easy fix for me
(just switch to the IRI parser) but allowing IRIs would be an actual change
to the protocol.  I'm just saying.

> Does this mean my server is already compliant? What else should I try?

  Perhaps allow "xn--caf-dma.mozz.us" as a hostname?

  -spc

> [0] https://github.com/michael-lazar/jetforce/blob/master/jetforce_client.py
> [1] It's nice to finally get a win for python after fighting with TLS
> for so long

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 10, 2020, at 02:16, Sean Conner <sean at conman.org> wrote:
> 
>  -spc (Seriously, I feel like I'm the only one *doing* anything here)

:))

To be is to do. ?Socrates
To do is to be. ?Plato
Do-be-do-be-do. ?Sinatra

https://quoteinvestigator.com/2015/01/26/doing/

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 10, 2020, at 05:16, Michael Lazar <lazar.michael22 at gmail.com> wrote:
> 
> Next, my naive python test client just kind of works as-is [0][1]. It will
> convert unicode DNS names to punycode under the hood before doing the lookup.

Perhaps of interest:

How do I know when to do a UTF8 or punycode DNS query?
https://stackoverflow.com/questions/16837513/how-do-i-know-when-to-do-a-utf
8-or-punycode-dns-query

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great Petite Abeille once stated:
> 
> 
> > On Dec 10, 2020, at 02:16, Sean Conner <sean at conman.org> wrote:
> > 
> >  -spc (Seriously, I feel like I'm the only one *doing* anything here)
> 
> :))
> 
> To be is to do. ?Socrates
> To do is to be. ?Plato
> Do-be-do-be-do. ?Sinatra

  You forgot:

  Do-do-do-do. -Serling

  -spc (There is a fifth dimension, beyond that which is known to man ...)

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 10, 2020, at 09:22, Sean Conner <sean at conman.org> wrote:
> 
>  Do-do-do-do. -Serling

?It may be said with a degree of assurance that not everything that meets 
the eye is as it appears.? 
? Rod Serling, The Twilight Zone

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Wed, Dec 09, 2020 at 08:16:34PM -0500,
 Sean Conner <sean at conman.org> wrote 
 a message of 51 lines which said:

> This internationalization stuff is complex and makes me want to
> throw up hands in the air, scream a bit, and go back to the
> simplicity of ASCII.

ASCII is not simple (think of case-insensitivity) and then only for
people whose latin is the first script they learned.

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Wed, Dec 09, 2020 at 11:16:30PM -0500,
 Michael Lazar <lazar.michael22 at gmail.com> wrote 
 a message of 55 lines which said:

> First of all, my domain registrar won't even let me put unicode
> characters in an A record without automatically converting them to
> punycode for me.

Small detail: it is an issue with the DNS *hoster*. Which may be the
same as the registrar or not (if you host your own authoritative name
servers).

> My server (running jetforce) also works as-is.

But not Gemserv (I have to figure out why).

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Wed, Dec 09, 2020 at 10:25:40PM -0500,
 Sean Conner <sean at conman.org> wrote 
 a message of 53 lines which said:

>   And if it's so simple, why not do it? But I get it, you'd rather wait
> until a yeah/nay decision is made.

It is reasonable to discuss it first because we need a *standard* way
of doing it. Clients and servers must agree or there will be no
interoperability.

Also, I suspect the problem is partially a social one:  programs are
written by programmers. Most programmers are familiar with english and
with the latin script. Therefore, the issue does not seem pressing for
most of them.

Link to individual message.

marc <marcx2 (a) welz.org.za>

Hi

> > This internationalization stuff is complex and makes me want to
> > throw up hands in the air, scream a bit, and go back to the
> > simplicity of ASCII.
> 
> ASCII is not simple (think of case-insensitivity) and then only for
> people whose latin is the first script they learned.

I am struggling to take that statement seriously,
and not just because it breaks set theory :-)

Case conversion in ascii is xor 0x20 - that doesn't
even require a branch/comparison and can compile down
to a single assembly instruction.

This versus *many* tens or even hundreds of thousands of
lines of puny/unicode/etc logic.

But lets assume upper/lowercase characters in ascii
are confusing. That would be an argument to restrict
a simple system such as gemini urls to a subset of ascii
which excludes uppercase characters. Which I could support,
and which is effectively what dns ends up doing - as
do the majority of http urls. "Lowest common denominator
for maximum interoperability" is a good maxim.

If ascii case conversion is confusing, then this isn't
an excuse to grow this confusion by many orders of
magnitude. That makes the problem a lot worse.

  "Oops, I've burnt my toast - I know, lets solve that
   by burning down the house"

regards

marc

Link to individual message.

Jason McBrayer <jmcbray (a) carcosa.net>

Petite Abeille <petite.abeille at gmail.com> writes:

>> On Dec 10, 2020, at 05:16, Michael Lazar <lazar.michael22 at gmail.com> wrote:
>> 
>> Next, my naive python test client just kind of works as-is [0][1]. It will
>> convert unicode DNS names to punycode under the hood before doing the lookup.
>
> Perhaps of interest:
>
> How do I know when to do a UTF8 or punycode DNS query?
> https://stackoverflow.com/questions/16837513/how-do-i-know-when-to-do-a-u
tf8-or-punycode-dns-query

You can unconditionally just run the punycode encoder over domain names,
though ? an all-ASCII domain name will be unchanged. The stackoverflow
question deals with non-internet domain names in Active Directory, which
we don't have to support.

-- 
+-----------------------------------------------------------+
| Jason F. McBrayer                    jmcbray at carcosa.net  |
| A flower falls, even though we love it; and a weed grows, |
| even though we do not love it.            -- Dogen        |

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 10, 2020, at 15:20, Jason McBrayer <jmcbray at carcosa.net> wrote:
> 
> non-internet domain names in Active Directory, which
> we don't have to support.

Hmmm... so... no .local queries ala Cheshire?

https://tools.ietf.org/html/rfc6762#appendix-F

Perhaps worthwhile quoting in full:

Appendix F.  Use of UTF-8

   After many years of debate, as a result of the perceived need to
   accommodate certain DNS implementations that apparently couldn't
   handle any character that's not a letter, digit, or hyphen (and
   apparently never would be updated to remedy this limitation), the
   Unicast DNS community settled on an extremely baroque encoding called
   "Punycode".  Punycode is a remarkably ingenious encoding
   solution, but it is complicated, hard to understand, and hard to
   implement, using sophisticated techniques including insertion unsort
   coding, generalized variable-length integers, and bias adaptation.
   The resulting encoding is remarkably compact given the constraints,
   but it's still not as good as simple straightforward UTF-8, and it's
   hard even to predict whether a given input string will encode to a
   Punycode string that fits within DNS's 63-byte limit, except by
   simply trying the encoding and seeing whether it fits.  Indeed, the
   encoded size depends not only on the input characters, but on the
   order they appear, so the same set of characters may or may not
   encode to a legal Punycode string that fits within DNS's 63-byte
   limit, depending on the order the characters appear.  This is
   extremely hard to present in a user interface that explains to users
   why one name is allowed, but another name containing the exact same
   characters is not.  Neither Punycode nor any other of the "ASCII-
   Compatible Encodings" proposed for Unicast DNS may be used
   in Multicast DNS messages.  Any text being represented internally in
   some other representation must be converted to canonical precomposed
   UTF-8 before being placed in any Multicast DNS message.

Link to individual message.

cage <cage-dev (a) twistfold.it>

On Thu, Dec 10, 2020 at 10:15:42AM +0100, Stephane Bortzmeyer wrote:

Hi folks!

> On Wed, Dec 09, 2020 at 10:25:40PM -0500,
>  Sean Conner <sean at conman.org> wrote
>  a message of 53 lines which said:
>
> >   And if it's so simple, why not do it? But I get it, you'd rather wait
> > until a yeah/nay decision is made.
>
> It is reasonable to discuss it first because we need a *standard* way
> of doing it. Clients and servers must agree or there will be no
> interoperability.

I have read most of the messages in this thread, i would just say that
one  of  the  problems  with  WWW is  that  browser  are  getting  not
manageable by a single user or a hobbyist programmer.

This  issue  lead to  centralization  (see  chromium and  the  company
behind) as most of us can easily see.

Adding  more complexity  and more  responsibility to  software authors
will shrink diversity in the gemini software landscape.

Of  course i  talk here  as  a client  author  here but  in the  niche
language i  chose (common lisp) i  was forced to write  my URL parsing
procedure, adding i18n domains will require  a lot of work because the
third party library  (the only one) Jason McBrayer  wrote some message
above does  not implement  punycode->unicode conversion (if  i checked
the right library, thanks to the author anyway, better than nothing!).

Probably many of you just thinking "who cares about CL?", and probably
this is the mindset that lead to the mess it is now the web.

Internationalized  hostname  has  advantages  but  how  this  adding
complexity impact software author? Is this complexity needed?

I have no answer, just would want to express some of my concerns.

Bye!
C.

PS: i am not a native English speaker (as you can see :-) )

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 10, 2020, at 16:54, cage <cage-dev at twistfold.it> wrote:
> 
> Internationalized  hostname  has  advantages  but  how  this  adding
> complexity impact software author? Is this complexity needed?

Ah, yes, le charme discret du r?gionalisme.

It all boils down to the unmeasurable joy of Unicode ?

> I have no answer, just would want to express some of my concerns.

Archibald 'Harry' Tuttle had about 3 lines in Terry Gilliam's' Brazil:

? Well, that's a pipe of a different color.
? Listen, this whole system of yours could be on fire and I couldn't even 
turn on the kitchen tap without filling out a twenty-seven B stroke six... 
bloody paperwork.
? Listen, kid, we're all in it together.

This sums it up in terms of retrofitting Unicode into ASCII:

https://www.youtube.com/watch?v=VRfoIyx8KfU

Perhaps Unicode should be abandoned altogether, and we all move back to to 
the original simplicity of ASCII.

Link to individual message.

cage <cage-dev (a) twistfold.it>

On Thu, Dec 10, 2020 at 05:27:05PM +0100, Petite Abeille wrote:
>
>
> > On Dec 10, 2020, at 16:54, cage <cage-dev at twistfold.it> wrote:
> >
> > Internationalized  hostname  has  advantages  but  how  this  adding
> > complexity impact software author? Is this complexity needed?
>
> Ah, yes, le charme discret du r?gionalisme.
>
> It all boils down to the unmeasurable joy of Unicode ?
>
> > I have no answer, just would want to express some of my concerns.
>
> Archibald 'Harry' Tuttle had about 3 lines in Terry Gilliam's' Brazil:

One of my favourite movie! I  love the "retr? style" terminal shown in
the office! :)

Bye!
C.

Link to individual message.

John Cowan <cowan (a) ccil.org>

On Thu, Dec 10, 2020 at 10:55 AM cage <cage-dev at twistfold.it> wrote:


> I have read most of the messages in this thread, i would just say that
> one  of  the  problems  with  WWW is  that  browser  are  getting  not
> manageable by a single user or a hobbyist programmer.
>

This is not strictly a Web issue: it is a DNS issue and affects all
protocols, including FTP, email, Gopher, etc.

> Probably many of you just thinking "who cares about CL?"
>

As a Schemer, I definitely do care about it.


> Internationalized  hostname  has  advantages  but  how  this  adding
> complexity impact software author? Is this complexity needed?
>

It's a balancing act between the needs of software authors and the needs of
content authors.  If Gemini succeeds, the latter will be much more common.
I think that internationalized link lines (which will have to become part
of the definition of text/gemini) are very important to authors.  Whether
clients accept IRIs in the address bar (or equivalent) is up to the client
author.  And I very strongly feel that changing the Gemini *protocol* to
pass IRIs serves nobody and shouldn't even be considered.


> PS: i am not a native English speaker (as you can see :-) )
>

Good.  As the saying is, if you want to know if there is antisemitism in a
particular place, ask the Jews who live there.



John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
I now introduce Professor Smullyan, who will prove to you that either
he doesn't exist or you don't exist, but you won't know which.
                               --Melvin Fitting

Link to individual message.

John Cowan <cowan (a) ccil.org>

On Thu, Dec 10, 2020 at 11:27 AM Petite Abeille <petite.abeille at gmail.com>
wrote:


> It all boils down to the unmeasurable joy of Unicode ?
>

As someone who is intimately familiar with i18n in the pre-Unicode era, I
can say that things were 100 times worse then.  Unicode is flawed because
it had to compromise with existing encodings, which is why we need
normalization.  But without that compromise (which permits 1-1
convertibility from almost all encodings to and from Unicode), it would
never have been so widely adopted.


> Perhaps Unicode should be abandoned altogether, and we all move back to to
> the original simplicity of ASCII.


Perhaps we should abandon all modern languages and just use Latin on the
Internet.

John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
Gules six bars argent on a canton azure 50 mullets argent
six five six five six five six five and six
   --blazoning the U.S. flag <http://web.meson.org/blazonserver>

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great Stephane Bortzmeyer once stated:
> On Wed, Dec 09, 2020 at 10:25:40PM -0500,
>  Sean Conner <sean at conman.org> wrote 
>  a message of 53 lines which said:
> 
> >   And if it's so simple, why not do it? But I get it, you'd rather wait
> > until a yeah/nay decision is made.
> 
> It is reasonable to discuss it first because we need a *standard* way
> of doing it. Clients and servers must agree or there will be no
> interoperability.

  Okay,  Here's a IRI:

	gemini://caf?.mozz.us/files/?????.txt

  Please specify what a client and server MUST do to properly handle this.

  -spc

Link to individual message.

Gary Johnson <lambdatronic (a) disroot.org>

Sean Conner <sean at conman.org> writes:

>   Okay,  Here's a IRI:
>
> 	gemini://caf?.mozz.us/files/?????.txt
>
>   Please specify what a client and server MUST do to properly handle this.

Well, if I'm following all of these conversations correctly to date, I
believe the procedure looks like this:

1. Punycode the hostname.

2. Percent-encode reserved characters and non-US-ASCII characters in the
   path, query, and fragment components.

3. Make a DNS query with the punycoded hostname.

4. Send the punycode + percent-encoded URI as the request to the Gemini
   server.

5. The server parses the URI into scheme, host, port, path, query, and
   fragment components and then percent-decodes the path, query, and
   fragment strings.

6. The parsed and decoded URI information can then either be used to
   perform a file retrieval, generate a directory listing, or run a CGI
   script, ultimately sending back a valid Gemini response to the
   client. Redirect responses should make sure to percent-encode the
   path, query, and fragment components of the redirected URI.

My Gemini server (Space Age) handles steps 5 and 6 as described here (as
I suspect most Gemini servers do). Clients should already be performing
step 2 as per the Gemini spec.

I suspect the missing piece of the puzzle here is *just* having client
authors implement steps 1, 3, and 4 (for some definition of "just"). I
don't think these client changes would require any changes to the
current Gemini spec.

There is also the open question of whether servers should convert
punycoded hostnames back into unicode hostnames for the purposes of
virtual hosting (either via SNI or post-handshake). Since at least one
poster has indicated that the widespread unevenness in DNS support for
unicode has lead to the need to store A records in their punycoded form,
this suggests to me that virtual hosting may be performed most
universally by just directly matching the received punycoded domain
names.

Of course, YMMV.

Happy hacking,
  Gary

-- 
GPG Key ID: 7BC158ED
Use `gpg --search-keys lambdatronic' to find me
Protect yourself from surveillance: https://emailselfdefense.fsf.org
=======================================================================
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments

Why is HTML email a security nightmare? See https://useplaintext.email/

Please avoid sending me MS-Office attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html

Link to individual message.

John Cowan <cowan (a) ccil.org>

On Thu, Dec 10, 2020 at 8:12 PM Gary Johnson <lambdatronic at disroot.org>
wrote:

1. Punycode the hostname.
>

If there is one.  You can look for "//" on the left and the next "/" on the
right, so you don't need full parsing.

> 2. Percent-encode reserved characters and non-US-ASCII characters in the
>    path, query, and fragment components.
>

You don't want to escape the ASCII reserved characters, because they should
already be escaped.  Changing the path /foo/bar.gmi to %25foo%25bar.gmi
would be Evil and Wrong.  If you really want that path, you have to encode
it yourself.

In addition, you can safely %-encode the whole IRI reference without
parsing it, since Punycode names are always safe.
2.5. If the IRI is a relative reference, resolve it against the URI of the
text/gemini file that contains it.

3. Make a DNS query with the punycoded hostname.
>
> 4. Send the punycode + percent-encoded URI as the request to the Gemini
>    server.
>

Note that fragments must not be sent, so if there is one, chop it off.


> 5. The server parses the URI into scheme, host, port, path, query, and
>    fragment components and then percent-decodes the path, query, and
>    fragment strings.
>

Consequently, the server will not get a fragment string.  There would be no
need for fragment strings if they were understood on the server side;
they'd just be part of the path.

Whether it %-decodes or not is up to the server.  If it's serving a
conventional file system, then it needs to document whether it does such
decoding.  If it isn't, it can do whatever it wants to with the paths.


>  6. The parsed and decoded URI information can then either be used to

   perform a file retrieval, generate a directory listing, or run a CGI
>    script, ultimately sending back a valid Gemini response to the
>    client. Redirect responses should make sure to percent-encode the
>    path, query, and fragment components of the redirected URI.
>

Except not the fragment.

> Since at least one
> poster has indicated that the widespread unevenness in DNS support for
> unicode has lead to the need to store A records in their punycoded form,
>

Indeed, I don't think that any registrar using the standard DNS root will
even register non-punycoded names.  MS Active Directory DNS servers are
another story.



John Cowan          http://vrici.lojban.org/~cowan        cowan at ccil.org
This great college [Trinity], of this ancient university [Cambridge],
has seen some strange sights. It has seen Wordsworth drunk and Porson
sober. And here am I, a better poet than Porson, and a better scholar
than Wordsworth, somewhere betwixt and between.  --A.E. Housman

Link to individual message.

colecmac@protonmail.com <colecmac (a) protonmail.com>

> 4.  Send the punycode + percent-encoded URI as the request to the Gemini
>     server.

This probably makes sense since IRIs aren't being used. I originally
advocated for sending Unicode to the server, for the domain only, but that's
just a weird mix isn't it.

Good server software should be taking the admin's hostname input (from config)
and punycoding it though, so that the admin can enter the Unicode domain name
and not have to worry. Obviously this is outside of the spec, but I think
it's a good thing to implement.


makeworld

Link to individual message.

William Orr <will (a) worrbase.com>

Before percent-encoding/punycoding, the URI needs to be NFC normalized.

As a matter of course, I'd say that servers should normalize the path 
before doing fs lookups/proxying it as well.

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Thu, Dec 10, 2020 at 04:54:52PM +0100,
 cage <cage-dev at twistfold.it> wrote 
 a message of 44 lines which said:

> Adding more complexity and more responsibility to software authors
> will shrink diversity in the gemini software landscape.

The idea is that it will not be done by the programmer (Unicode is
complicated) of the Gemini client but mostly by the libraries she or
he uses. It is the same with TLS: TLS is very complicated but most
people do not program it by themselves (and rightly so).

> Internationalized  hostname  has  advantages  but  how  this  adding
> complexity impact software author? Is this complexity needed?

Since I write Gemini clients, I have sympathy for this point of
view. However, let me quote RFC 8890

=> gemini://gemini.bortzmeyer.org/rfc-mirror/rfc8890.txt

4.5.  Deprioritizing Internal Needs

   There are several needs that are very visible to us as specification
   authors but should explicitly not be prioritized over the needs of
   end users.

   These include convenience for document editors, IETF process matters,
   and "architectural purity" for its own sake.


=> https://www.w3.org/TR/html-design-principles/#priority-of-constituencies
 See aso this statement by W3C

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Thu, Dec 10, 2020 at 04:09:19PM -0500,
 John Cowan <cowan at ccil.org> wrote 
 a message of 76 lines which said:

> Unicode is flawed because it had to compromise with existing
> encodings,

And also because human scripts (and languages) are a mess and Unicode
choosed, a long time ago, to deal with it instead of whining "why
can't they just speak english?"

> Perhaps we should abandon all modern languages and just use Latin on the
> Internet.

=> https://en.wikipedia.org/wiki/Lojban No, Lojban

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Thu, Dec 10, 2020 at 08:12:04PM -0500,
 Gary Johnson <lambdatronic at disroot.org> wrote 
 a message of 69 lines which said:

> 1. Punycode the hostname.

Not always, for the reasons explained in RFC 6055. To summarize: the
application does not always know which name resolution system will be used.

gemini://gemini.bortzmeyer.org/rfc-mirror/rfc6055.txt

> 3. Make a DNS query with the punycoded hostname.

Most applications don't do DNS queries, both because DNS is
complicated and because there are other name resolutions system. They
call a system routine (getaddrinfo() in C) to do the resolution.

> There is also the open question of whether servers should convert
> punycoded hostnames back into unicode hostnames for the purposes of
> virtual hosting (either via SNI or post-handshake). Since at least one
> poster has indicated that the widespread unevenness in DNS support for
> unicode has lead to the need to store A records in their punycoded form,
> this suggests to me that virtual hosting may be performed most
> universally by just directly matching the received punycoded domain
> names.

This is what Apache and Nginx do in the Web world (which does not mean
they are right).

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Thu, Dec 10, 2020 at 09:45:55PM -0500,
 John Cowan <cowan at ccil.org> wrote 
 a message of 177 lines which said:

> Indeed, I don't think that any registrar using the standard DNS root
> will even register non-punycoded names.

Counter-example about a similar case: the registry of .ws accepts
names with emojis, which are forbidden by the standard (because they
are symbols, not letters). So, anything can happen.

=> https://www.worldsite.ws/idn/emoji.dhtml?sponsor=index.dhtml And they boast about it

(Also, not all name registration goes through a registrar: when I edit
bortzmeyer.org, I can add what I want without any intermediary.)

Link to individual message.

cage <cage-dev (a) twistfold.it>

On Fri, Dec 11, 2020 at 09:57:02AM +0100, Stephane Bortzmeyer wrote:

Hi!

> On Thu, Dec 10, 2020 at 04:54:52PM +0100,
>  cage <cage-dev at twistfold.it> wrote
>  a message of 44 lines which said:
>
> > Adding more complexity and more responsibility to software authors
> > will shrink diversity in the gemini software landscape.
>
> The idea is that it will not be done by the programmer (Unicode is
> complicated) of the Gemini client but mostly by the libraries she or
> he uses.

If such library does exists, otherwise  more and more works is needed,
and this will exclude the programmers  that have no time or skills (to
me very likely the former) to do the all the work.

> It is the same with TLS: TLS is very complicated but most
> people do not program it by themselves (and rightly so).

This  is matter  of  balance,  the advantages  of  TLS  are worth  the
complexity added, IDN? I am not sure.

> > Internationalized  hostname  has  advantages  but  how  this  adding
> > complexity impact software author? Is this complexity needed?
>
> Since I write Gemini clients, I have sympathy for this point of
> view. However, let me quote RFC 8890

OK the point is  valid to me, for TLS, but not for  IDN. Anyway i have
the impression i am in a minority  here, and i think i should start to
do a minimal wrapping of libidn at this point. :-)

Bye!
C.

Link to individual message.

cage <cage-dev (a) twistfold.it>

On Thu, Dec 10, 2020 at 03:02:44PM -0500, John Cowan wrote:

Hi!

> On Thu, Dec 10, 2020 at 10:55 AM cage <cage-dev at twistfold.it> wrote:
>
>
> > I have read most of the messages in this thread, i would just say that
> > one  of  the  problems  with  WWW is  that  browser  are  getting  not
> > manageable by a single user or a hobbyist programmer.
> >
>
> This is not strictly a Web issue: it is a DNS issue and affects all
> protocols, including FTP, email, Gopher, etc.

Correct, i meant  that this is a client issue  deriving from debatable
choices.

> > Probably many of you just thinking "who cares about CL?"
> >
>
> As a Schemer, I definitely do care about it.

[OT] Nice! Do  yo have a preferred  dialect? I like Guile a  lot but i
fear i  end missing CLOS  (GOOPS is  not the same,  unfortunately) and
condition system.

>
> > Internationalized  hostname  has  advantages  but  how  this  adding
> > complexity impact software author? Is this complexity needed?
> >
>
> It's a balancing act between the needs of software authors and the needs of
> content authors.  If Gemini succeeds, the latter will be much more common.
> I think that internationalized link lines (which will have to become part
> of the definition of text/gemini) are very important to authors.  Whether
> clients accept IRIs in the address bar (or equivalent) is up to the client
> author.  And I very strongly feel that changing the Gemini *protocol* to
> pass IRIs serves nobody and shouldn't even be considered.

I agree!  My only concerns is  i have the impression  that client that
will  not supports  IRI will  be second  class citizen  in the  gemini
space, and they will die slowly.

So i think that IRI will be a de facto standard. :/

Bye!
C.

Link to individual message.

Jason McBrayer <jmcbray (a) carcosa.net>

colecmac at protonmail.com writes:

> Good server software should be taking the admin's hostname input (from config)
> and punycoding it though, so that the admin can enter the Unicode domain name
> and not have to worry. Obviously this is outside of the spec, but I think
> it's a good thing to implement.

To implement, *and* to document ? if not in the spec, then in a 'best
practices for implementers' document.

-- 
Jason McBrayer      | ?Strange is the night where black stars rise,
jmcbray at carcosa.net | and strange moons circle through the skies,
                    | but stranger still is lost Carcosa.?
                    | ? Robert W. Chambers,The King in Yellow

Link to individual message.

A. E. Spencer-Reed <easrng (a) gmail.com>

On Fri, Dec 11, 2020 at 4:13 AM Stephane Bortzmeyer
<stephane at sources.org> wrote:
> Counter-example about a similar case: the registry of .ws accepts
> names with emojis, which are forbidden by the standard (because they
> are symbols, not letters). So, anything can happen.
However, those names are still punycoded. Most registries will not
accept punycoded emojis, but .ws does.

Link to individual message.

Gary Johnson <lambdatronic (a) disroot.org>

John Cowan <cowan at ccil.org> writes:

>> 2. Percent-encode reserved characters and non-US-ASCII characters in the
>>    path, query, and fragment components.

> You don't want to escape the ASCII reserved characters, because they should
> already be escaped.  Changing the path /foo/bar.gmi to %25foo%25bar.gmi
> would be Evil and Wrong.  If you really want that path, you have to encode
> it yourself.

Yes, that is quite right. I suppose we are using a different
interpretation of the phrase "reserved characters" here. For clarity, I
meant characters such as those in the string " ?#", which are either
forbidden (when unencoded) within the path, query, and fragment
components or are used to delimit them.

> 2.5. If the IRI is a relative reference, resolve it against the URI of the
> text/gemini file that contains it.

Yep.

>> 4. Send the punycode + percent-encoded URI as the request to the Gemini
>>    server.
>
> Note that fragments must not be sent, so if there is one, chop it off.

I'm not sure that is the case here. To quote the Gemini spec:

========================================================================
1.2 Gemini URI scheme

Resources hosted via Gemini are identified using URIs with the scheme
"gemini". This scheme is syntactically compatible with the generic URI
syntax defined in RFC 3986, but does not support all components of the
generic syntax. In particular, the authority component is allowed and
required, but its userinfo subcomponent is NOT allowed. The host
subcomponent is required. The port subcomponent is optional, with a
default value of 1965. The path, query and fragment components are
allowed and have no special meanings beyond those defined by the generic
syntax. Spaces in gemini URIs should be encoded as %20, not +.
========================================================================

Please note the text about fragment components being allowed. I'm not
currently aware of any good uses for them in Gemini, but the spec
supports them, so I've included that support in my server.

>> 5. The server parses the URI into scheme, host, port, path, query, and
>>    fragment components and then percent-decodes the path, query, and
>>    fragment strings.
>
> Consequently, the server will not get a fragment string.  There would be no
> need for fragment strings if they were understood on the server side;
> they'd just be part of the path.

See above.

>>  6. The parsed and decoded URI information can then either be used to
>>     perform a file retrieval, generate a directory listing, or run a
>>     CGI script, ultimately sending back a valid Gemini response to
>>     the client. Redirect responses should make sure to percent-encode
>>     the path, query, and fragment components of the redirected URI.
>>
>
> Except not the fragment.

Again, see above.

Yada yada...spec compliance...yada yada.

Happy hacking,
  Gary

-- 
GPG Key ID: 7BC158ED
Use `gpg --search-keys lambdatronic' to find me
Protect yourself from surveillance: https://emailselfdefense.fsf.org
=======================================================================
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments

Why is HTML email a security nightmare? See https://useplaintext.email/

Please avoid sending me MS-Office attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html

Link to individual message.

colecmac@protonmail.com <colecmac (a) protonmail.com>

On Friday, December 11, 2020 4:58 AM, cage <cage-dev at twistfold.it> wrote:

> I agree! My only concerns is i have the impression that client that
> will not supports IRI will be second class citizen in the gemini
> space, and they will die slowly.
>
> So i think that IRI will be a de facto standard. :/


I do not think this will happen, and if it starts to happen I will
fight against it. I don't think the idea of "de facto standards" fits within
the Gemini ethos at all. It's not supposed to be extensible and clients
aren't supposed go off and do random things while others have to decide
what to do and play catch-up. That is how the Web grew and became more complex,
and it's why we have only a few browsers today.

The ecosystem benefits when we all just stick to the standard, with the perhaps
obvious exception of demos and toys.

Stay united, Gemini!

Cheers,
makeworld

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great Stephane Bortzmeyer once stated:
> On Thu, Dec 10, 2020 at 08:12:04PM -0500,
>  Gary Johnson <lambdatronic at disroot.org> wrote 
>  a message of 69 lines which said:
> 
> > 1. Punycode the hostname.
> 
> Not always, for the reasons explained in RFC 6055. To summarize: the
> application does not always know which name resolution system will be used.

  Yes, this is why I hata the incessent talking.  If you bothered to try it
on a few systems, you may have encountered *not* encoding with punycode

and they all failed to look up "caf?.mozz.us" (yes, via getaddrinfo() even). 
They all worked when I looked up "xn--caf-dma.mozz.us".

  So now that I've injected some ugly reality in to a nice theoretical
discussion, what's next?

  -spc

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great colecmac at protonmail.com once stated:
> On Friday, December 11, 2020 4:58 AM, cage <cage-dev at twistfold.it> wrote:
> 
> > I agree! My only concerns is i have the impression that client that
> > will not supports IRI will be second class citizen in the gemini
> > space, and they will die slowly.
> >
> > So i think that IRI will be a de facto standard. :/
> 
> I do not think this will happen, and if it starts to happen I will
> fight against it. 

  There's a reason why UTF-8 was selected as the default character set for
text/gemini, and one of those is to allow other people than English speakers
a means of expressing themsevles [1].  I don't think it's entirely
unreasonable to expect such a person to use Unicode for both domain name and
filenames [2].  Yes, tooling could be made to handle "canonicalizing" links
[3] but why not look into allowing IRIs?  Without an attempt at it, it would
be difficult to know what would work, what doesn't and where the difficulty,
if any, lie.  *That's* why I'm so insistent on coding up
"proof-of-concepts".  Just decreeing "this is how it shall be done" rarely
works out well [4].  And decreeing "this shall NOT be done" could put off
non-technical, non-English speaking people. [5]

> I don't think the idea of "de facto standards" fits within the Gemini
> ethos at all. It's not supposed to be extensible and clients aren't
> supposed go off and do random things while others have to decide what to
> do and play catch-up. That is how the Web grew and became more complex,
> and it's why we have only a few browsers today.

  And this is working out if the specification should be ammended to allow
IRIs, and if not, at at least have a jutification.

> The ecosystem benefits when we all just stick to the standard, with the
> perhaps obvious exception of demos and toys.

  One more point of reference.  The Gopher RFC (RFC-1436) states the use of
ISO-8859-1 for a character set.  It is wrong then, for gopher servers to
serve up UTF-8 documents even though it's not standard?  Yes, gopher is not
Gemini, but UTF-8 does seem to be a modern "de facto standard" in
gopherspace.

  -spc

[1]	For example, gemini://blekksprut.net/

[2]	Otherwise, punycode wouldn't exist.

[3]	Conversion from IRI to URI, with Unicode normalization,  prior to
	publication.

[4]	Such as X.200---lovingly developed and standardized but no one used
	it.  Or Xanadu.  Over 60 years of design work and still not working.

[5]	And I'm saying this being a thoroughly American mut that speaks only
	English.

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great cage once stated:
> On Fri, Dec 11, 2020 at 09:57:02AM +0100, Stephane Bortzmeyer wrote:
> 
> > Since I write Gemini clients, I have sympathy for this point of
> > view. However, let me quote RFC 8890
> 
> OK the point is  valid to me, for TLS, but not for  IDN. Anyway i have
> the impression i am in a minority  here, and i think i should start to
> do a minimal wrapping of libidn at this point. :-)

  Here's the code I wrote to wrap libidn:

	https://github.com/spc476/lua-conmanorg/blob/master/src/idn.c

It's geared for Lua, but the code itself is in C, but it should be pretty
easy to see what is going on.

  -spc

Link to individual message.

cage <cage-dev (a) twistfold.it>

On Fri, Dec 11, 2020 at 06:49:17PM -0500, Sean Conner wrote:

[...]

>
>   Here's the code I wrote to wrap libidn:
>
> 	https://github.com/spc476/lua-conmanorg/blob/master/src/idn.c
>
> It's geared for Lua, but the code itself is in C, but it should be pretty
> easy to see what is going on.

Thank  you Sean!   I am,  in fact  starting to  wrap libidn  (actually
libidn2) so far seems that i got a working unicode->ascii function. Of
course i will be happy to share  the results and maybe, if some people
are going to be  interested (and if i succeed!  :))  i could extract a
library from this code (it is integrated in the client at moment).

Bye!
C.

Link to individual message.

cage <cage-dev (a) twistfold.it>

On Fri, Dec 11, 2020 at 08:49:45PM +0000, colecmac at protonmail.com wrote:

[...]

> >
> > So i think that IRI will be a de facto standard. :/
>
>
> I do not think this will happen, and if it starts to happen I will
> fight against it. I don't think the idea of "de facto standards" fits within
> the Gemini ethos at all. It's not supposed to be extensible and clients
> aren't supposed go off and do random things while others have to decide
> what to do and play catch-up. That is how the Web grew and became more complex,
> and it's why we have only a few browsers today.

I am with you  with this but i think the only way  to let not developers
do their own way is to clarify  issues in the specs as much as possible.

> The ecosystem benefits when we all just stick to the standard, with the perhaps
> obvious exception of demos and toys.

Totally agree of course! I am trying to do so! :)

> Stay united, Gemini!

:) ?

Bye!
C.

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 12, 2020, at 00:43, Sean Conner <sean at conman.org> wrote:
> 
> There's a reason why UTF-8 was selected as the default character set for
> text/gemini, and one of those is to allow other people than English speakers
> a means of expressing themsevles [1].  I don't think it's entirely
> unreasonable to expect such a person to use Unicode for both domain name and
> filenames [2]

Yes. agree. People should be able to express themselves in the most 
idiomatic -and frictionless- way they see fit. 

It's a moral imperative -and duty- for Gemini to make it so.

The year is 2020, no more easy and lazy excuses. This is not a technical 
choice, but a moral one. 

Timely article in The Economist:

Accent discrimination betrays a small mind
https://www.economist.com/books-and-arts/2020/12/12/accent-discrimination-b
etrays-a-small-mind

History will judge you: be one the right side.

Do the right thing Solderpunk.

Link to individual message.

Philip Linde <linde.philip (a) gmail.com>

On Fri, 11 Dec 2020 18:14:47 -0500
Sean Conner <sean at conman.org> wrote:

> I know, because I tried on a few systems I have access to,
> and they all failed to look up "caf?.mozz.us" (yes, via getaddrinfo() even). 
> They all worked when I looked up "xn--caf-dma.mozz.us".

I don't think Stephane means different systems in the sense of
different computers, but different systems as in different name
resolution systems. On my computer, for example, there are at least
three sources for names: DNS, mDNS and /etc/hosts. getaddrinfo() can
resolve using any of these systems via the name service switch, but IDN
only concerns DNS names.

caf?.mozz.us in your tests probably always resolved via DNS, but if you
had a caf?.local mDNS name or "caf?" in your hosts file you might not
be able to use IDN. I don't know about the hosts file, but mDNS for
example uses UTF-8 encoded names directly.

-- 
Philip

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Mon, Dec 14, 2020 at 11:46:49AM +0100,
 Philip Linde <linde.philip at gmail.com> wrote 
 a message of 45 lines which said:

> but different systems as in different name resolution systems.

Yes.

> caf?.mozz.us in your tests probably always resolved via DNS, but if
> you had a caf?.local mDNS name or "caf?" in your hosts file you
> might not be able to use IDN. I don't know about the hosts file, but
> mDNS for example uses UTF-8 encoded names directly.

I just tested with a Debian box and it seems getaddrinfo (both from a
Python program and from a C one, ping), requires the name in
/etc/hosts to be present in Punycode form (A-label).

The good news is that the Python program does not have to do
punycoding itself, it is handled automatically by the standard
library.

Link to individual message.

cage <cage-dev (a) twistfold.it>

On Sat, Dec 12, 2020 at 12:10:48PM +0100, cage wrote:
> On Fri, Dec 11, 2020 at 06:49:17PM -0500, Sean Conner wrote:
>
> [...]
>
> >
> >   Here's the code I wrote to wrap libidn:
> >
> > 	https://github.com/spc476/lua-conmanorg/blob/master/src/idn.c
> >
> > It's geared for Lua, but the code itself is in C, but it should be pretty
> > easy to see what is going on.
>
> Thank  you Sean!   I am,  in fact  starting to  wrap libidn  (actually
> libidn2) so far seems that i got a working unicode->ascii function. Of
> course i will be happy to share  the results and maybe, if some people
> are going to be  interested (and if i succeed!  :))  i could extract a
> library from this code (it is integrated in the client at moment).

FWIW i managed to  switch from URI to IRI in  my client, fortunately i
was able to reuse  most of the URL parser. For Punycode  i wrapped a C
library.

Not sure if i did everything right but have met no regression so far.

If other  lisper (CL) are interested  we could arrange a  library from
this code.

Bye!
C.

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 14, 2020, at 16:09, cage <cage-dev at twistfold.it> wrote:
> 
> FWIW i managed to  switch from URI to IRI in  my client,

Bravo! :)

"Don't worry about what anybody else is going to do. The best way to 
predict the future is to invent it."
? Alan Kay

Link to individual message.

cage <cage-dev (a) twistfold.it>

On Mon, Dec 14, 2020 at 04:37:19PM +0100, Petite Abeille wrote:
>
>
> > On Dec 14, 2020, at 16:09, cage <cage-dev at twistfold.it> wrote:
> >
> > FWIW i managed to  switch from URI to IRI in  my client,
>
> Bravo! :)

Thank you! :)

Honestly  most of  the  hard work  has  been done  by  libidn and  two
excellent lisp libraries; sometimes people complains that CL is full of
half-baked  libraries but  there are  some high  quality too,  FFI and
parser generator are  two that excels, in my opinion.  Also the author
of the latter actually helped me spotting a bug in the code. :)

Bye!
C.

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great Philip Linde once stated:
> On Fri, 11 Dec 2020 18:14:47 -0500
> Sean Conner <sean at conman.org> wrote:
> 
> > I know, because I tried on a few systems I have access to,
> > and they all failed to look up "caf?.mozz.us" (yes, via getaddrinfo() even). 
> > They all worked when I looked up "xn--caf-dma.mozz.us".
> 
> I don't think Stephane means different systems in the sense of
> different computers, but different systems as in different name
> resolution systems. 

  Yes, I understand that, but on the systems I used, all FAILED to resolve
"caf?.mozz.us".  Does that mean the client just gives up and says "domain
not found?" because local configuration doesn't work with UTF-8 domain
names?  That, to me, sounds like what Stephane is advocating for when they
say "no conversion to punycode".  It's wonderful that Stephane's language
du jour will apparently handle it for the user, but are the rest of us out
of luck? [1]  THIS is what I'm asking about.

  -spc

[1]	And a response of "here's a nickel, get yourself a real computer
	language" is NOT a valid resonse.

Link to individual message.

colecmac@protonmail.com <colecmac (a) protonmail.com>

Hello,

I didn't see any email mentioning this so I thought I'd share the
link here. Lagrange[1] has gone ahead with IDN support, and the details
are found in this post[2] by skyjake.

The relevant points are as follows.

> * The full URL is NFC normalized before sending it to a server.
> * Domain names with non-ASCII characters are encoded to Punycode before
>   doing a DNS lookup. The Punycode version of the domain name is sent to
>   the server in the request URL, and also used for verifying the server
>   certificate.

This is what I plan on doing in Amfora as well. I will defer to Solderpunk's
judgement, which is coming[3], but until then that's my plan. The only
difference is that I was planning on allowing both punycoded domains and IDNs
in certs, to simplify things for sysadmins. But if Lagrange isn't allowing
it, then maybe I shouldn't... this is quickly approaching "de facto standard"
territory.

For now I will err on the permissive side in that case, allowing both, but this
is something I'd like hear from Solderpunk on.

gemget will do the same, as it uses the go-gemini[4] library as well.


1: https://gmi.skyjake.fi/lagrange/
2: gemini://skyjake.fi/gemlog/2020-12_idns-in-lagrange.gmi
3: gemini://gemini.circumlunar.space/~solderpunk/pikkulog/2020-12.gmi
4: https://github.com/makeworld-the-better-one/go-gemini/issues/10


Cheers,
makeworld

Link to individual message.

---

Previous Thread: Crawlers on Gemini and best practices

Next Thread: Good practices regarding MIME type