[spec] IRIs, IDNs, and all that international jazz

"Solderpunk" <solderpunk at posteo.net> writes:

Answering for Common Lisp, as best I know (I'm kind of a n00b). Detailed
Common Lisp spam below, skip if you are afraid of parentheses.

> 1. Parse and relativise URLs with non-ASCII characters (so, yes, okay,
>    technically not URLs at all, you know what I mean) in paths and/or
>    domains?

No URI handling in the standard library, but quicklisp has libraries for
it. I'm using quri, which I think is the most used, and it seems to be
fine.

CL-USER> (defparameter *my-iri* (quri:uri "gemini://r?ksm?rg?s.josefsson.org/?/?.gmi"))


CL-USER> *my-iri*
#<QURI.URI:URI gemini://r?ksm?rg?s.josefsson.org/?/?.gmi>

CL-USER> (quri:uri-domain *my-iri*)
"josefsson.org"

CL-USER> (quri:uri-authority *my-iri*)
"r?ksm?rg?s.josefsson.org"

CL-USER> (quri:uri-path *my-iri*)
"/?/?.gmi"

CL-USER> (quri:uri-query *my-iri*)
NIL

CL-USER> (setf (quri:uri-path *my-iri*) "?/?.gmi")
"?/?.gmi"

CL-USER> (quri:uri-path *my-iri*)
"?/?.gmi"

> 2. Transform back and forth between URIs and IRIs?

Using idna package in quicklisp on the hostname:

CL-USER> (idna:to-ascii (quri:uri-authority *my-iri*))
"xn--rksmrgs-5wao1o.josefsson.org"

CL-USER> (idna:to-unicode (idna:to-ascii (quri:uri-authority *my-iri*)))
"r?ksm?rg?s.josefsson.org"

And URL-encoding on the path:

CL-USER> (quri:url-encode (quri:uri-path *my-iri*))
"%F0%9F%90%87%2F%F0%9F%90%B0.gmi"

And decoding the path:

CL-USER> (quri:url-decode (quri:url-encode (quri:uri-path *my-iri*)))
"?/?.gmi"

I will note, however, that (quri:url-decode "?/?.gmi") produces
garbage, which means on the server I can't use the library to fix up the
space in "?/?%20?.gmi" when getting the filename for the IRI, and
will have to write a unicode-safe function to handle decoding just
IRI reserved characters.

Putting these together into a function and handling edge-cases is
something I'll do if it turns out I have to.

> 3. Do DNS lookups of IDNs without them being punycoded first?  You can
>    test this with r?ksm?rg?s.josefsson.org.

The CL standard library is actually so old it doesn't have
sockets/gethostbyname. But everyone uses usocket, which is in quicklisp:

CL-USER> (usocket:get-host-by-name (quri:uri-authority *my-iri*))
#(178 174 241 102)

So that works, without punycoding, at least in my environment (sbcl
2.0.1, Linux 5.8.18, Fedora 33). It might be worth someone trying sbcl
on a BSD to see if their resolver behaves differently.

> Getting good data on all three of these questions for a wide range
> of languages is necessary to make a well-informed decision here.

Personally, I would be most gratified if option 3 proved to be workable.

-- 
+-----------------------------------------------------------+
| Jason F. McBrayer                    jmcbray at carcosa.net  |
| A flower falls, even though we love it; and a weed grows, |
| even though we do not love it.            -- Dogen        |

---

Previous in thread (36 of 109): 🗣️ Jacob Moody (moody (a) posixcafe.org)

Next in thread (38 of 109): 🗣️ Petite Abeille (petite.abeille (a) gmail.com)

View entire thread.