Recently, I was wondering about international domain names and my Gemini-wiki, Phoebe, after reading a post in French about international domain names. Today, I decided to give it a try. My laptop is called “melanobombus“ because it’s black and I like bumblebees, so I wanted to give it the alias “mélanobombus” and give it a try.
The first thing to install was a punycode converter. I used “idn”. The punycode for “mélanobombus” is “xn--mlanobombus-bbb”. So this is how my “/etc/hosts” begins:
127.0.0.1 localhost 127.0.1.1 melanobombus 127.0.1.1 xn--mlanobombus-bbb
Then I started Phoebe with that hostname:
phoebe --host=xn--mlanobombus-bbb
Since I trust that Firefox knows how to handle international domain names, I started by pointing it at “https://mélanobombus:1965/” – and it worked. 😁
Using my super simple command line client did not work. When I asked it to connect to “gemini://mélanobombus/” it broke with an ugly error message. When I asked Elpher to connect to the same address, it didn’t work either: timeout. Lagrange also reported a network failure.
OK. At least now I know that this is a client problem because Firefox does the right thing.
#Gemini #Phoebe
(Please contact me if you want to remove your comment.)
⁂
In Perl, I need to do the following to get the same punycode from a URL provided on the command line:
use Modern::Perl; my ($url) = @ARGV; my $iri = IRI->new(value => $url); say domain_to_ascii(decode_utf8 $iri->host);
But then the client still sends the original IRI to the server which then replies that it won’t proxy, unlike the result of using Firefox.
– 2020-12-10 23:25 UTC
---
Ah, it’s more complicated, of course. HTTP doesn’t actually send an URI! It sends something like this:
GET /some/path HTTP/1.1 Host: xn--mlanobombus-bbb
Well, I’m working on a branch where my simple command line client and Phoebe work together, at least. I feel like I owe this to my last name. In the previous milenium, I started to write Schroeder instead of Schröder because internationalisation was a big problem. This was before I had ever heard of Unicode and UTF-8.
– 2020-12-11 11:23 UTC
---
Oh my invisible friend... the changes required aren’t trivial. Still on it!
– 2020-12-11 12:25 UTC
---
Wow, I’ve been looking at the mailing list. Sooo much discussion! Three threads:
– 2020-12-11 16:22 UTC
---
Talked about it a bit on the Gemini IRC channel until I got angry. 😒
– 2020-12-11 18:02 UTC
---
OK, so I abandoned the international domain name (IDN) branch where the client sends gemini://東京.jp/ to the server because I thought it was stupid that the client could send gemini://東京.jp/ but not gemini://東京.jp/日本語. So then I went back to RFC 3987 and read the introduction to section 3, “Relationship between IRIs and URIs”:
IRIs are meant to replace URIs in identifying resources for protocols, formats, and software components that use a UCS-based character repertoire. These protocols and components may never need to use URIs directly, especially when the resource identifier is used simply for identification purposes. However, when the resource identifier is used for resource retrieval, it is in many cases necessary to determine the associated URI, because currently most retrieval mechanisms are only defined for URIs. In this case, IRIs can serve as presentation elements for URI protocol elements. An example would be an address bar in a Web user agent.
This seems ideal. Clients are free to use IRIs to communicate with their users: letting them enter an IRI like gemini://東京.jp/日本語 into the address bar and showing them such URIs. But if these clients communicate with a Gemini server, they need to use URIs. They need to request gemini://xn--1lqs71d.jp:43343/page/%E6%97%A5%E6%9C%AC%E8%AA%9E.
gemini://xn--1lqs71d.jp:43343/page/%E6%97%A5%E6%9C%AC%E8%AA%9E
This reduces the problem for Phoebe to a much smaller set of problems.
How does a server administrator start Phoebe such that it serves an international domain name (IDN)? I added code that converts the host name provided from the current locale (which is wha the administrator is using in their shell) to Unicode, and converts that to punycode, and uses that to look up the IP addresses using getaddrinfo(3).
When deciding whether to serve a URL, Phobe checks for the punycode representation of the host names: xn--1lqs71d.jp. Anything else is considered to be a proxy request and is most likely denied.
The part handling the percent encoded paths already exists and already works.
What remains is a usability problem, of course. When users write their gemtext, they need to use some sort of tool to do the conversions. It’s basically delegated to Editor support. I guess I’m fine with that, for the moment. Allowing users to link to IRIs and transparently translating them to URIs as they get sent to the client would be an easy change to make. I’d still have to solve such problems as how to handle a space character. If the server sees “⇒ One Two Three” this is a relative link to “One”. Otherwise the user would have had to write “⇒ One%20Two Three” or “⇒ One%20Two%20Three”. Then again, perhaps I can just leave it as-is because I often copy and paste weird URIs from elsewhere.
In either case, I can definitely delay this. 😁
So now all I need to do to get some closure is to add some sort of IRI handling to my simple command-line client.
My “/etc/hosts” has the punycode encoding of the new hostname:
127.0.0.1 localhost 127.0.1.1 melanobombus 127.0.1.1 xn--mlanobombus-bbb
I start Phoebe using a non-ASCII hostname and a non-ASCII pagename:
script/phoebe --host=mélanobombus --wiki_page=Schröder
Then I use the “gemini” client:
$ script/gemini --verbose gemini://mélanobombus/page/Schröder Contacting xn--mlanobombus-bbb:1965 Requesting gemini://xn--mlanobombus-bbb:1965/page/Schr%C3%83%C2%B6der 20 text/gemini; charset=UTF-8 # Schröder This page does not yet exist. More: => gemini://xn--mlanobombus-bbb:1965/history/Schr%C3%83%C2%B6der History => gemini://xn--mlanobombus-bbb:1965/raw/Schr%C3%83%C2%B6der Raw text => gemini://xn--mlanobombus-bbb:1965/html/Schr%C3%83%C2%B6der HTML
Happy! 🥳🚀🚀🎉
And when I use Firefox, it works as well. 🙂
The log. Notice the host header.
[debug] HTTP headers: referer => 'https://xn--mlanobombus-bbb:1965/', dnt => '1', accept-encoding => 'gzip, deflate, br', upgrade-insecure-requests => '1', user-agent => 'Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0', accept-language => 'de-CH,de;q=0.8,en-US;q=0.5,en;q=0.3', accept => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', host => 'xn--mlanobombus-bbb:1965', connection => 'keep-alive' [info] Looking at GET /page/Schr%C3%B6der HTTP/1.1 [info] Serving Schröder as HTML via HTTP
– Alex
---
OK, I got something!
– 2020-12-14 18:49 UTC
---
URL Interop, by @bagder. “This document is an attempt to describe where and how RFC 3986 (86), RFC 3987 (87) and the WHATWG URL Specification (TWUS) differ. This might be useful input when trying to interop with URLs on the modern Internet.”
– Alex 2021-01-18 10:25 UTC