[spec] IRIs, IDNs, and all that international jazz

On Sat Dec 26, 2020 at 1:32 AM CET, Omar Polo wrote:

> In the last two days I took the time to write first a proper URL
> parser[0], and than extending it to support IRIs[1]. Turns out, once
> you have a URL parser (not hard to do at all), you almost have a
> complete IRI parser. As Sean wrote, you basically have to replace the
> unreserved rule to allow other utf8 characters and you're done. And
> even if you're uncomfortable doing this, the RFC lists the valid ranges,
> so adding a couple of checks isn't the end of the world (if you want to
> be 100% compliant, whatever that means).
>
> (And all of this comes from one that has never, ever, implemented a
> IRI/URI parser before, that has read for the first time the rfc3986
> while writing the code and has successfully -- I believe -- implemented
> a full IRI parser in less than 500 lines of C, with comments and
> everything, without using anything other than the standard library.
> Heck, the parser doesn't even allocates memory.)

This is, more and more, how I'm conceptualising things.
Parsing/validating IRIs is not actually remotely difficult at all.
Algorithmically it's an extremely minor change to parsing/validating
URIs.  The apparent pain exists only because the world has apparently
been very slow about packaging code up for this into major
libraries/languages, probably because HTTP's ASCII-only nature reduces
demand.  If we adopt IRIs, I would actually encourage Gemini software
authors who find their language lacking tools for this not to write
custom code for it that lives only in their software, but to actually
try to get the functionality accepted upstream into standard libraries,
or widely used third-party libraries.  This is generally useful
functionality that's in no way Gemini-specific, and having easy support
for it everywhere makes the world a better place regardless of whether
Gemini thrives or declines.

I don't really think the alleged difficulty of handling IRIs is a good
argument against accepting them.  I'm now more interested in
learning/thinking about normalisation issues, which have been relatively
under discussed so far.  It's possible this is where the real trouble
lies.  Breaking a UTF-8 IRI up into (scheme, authority, path) is not a
substantial hurdle.

Cheers,
Solderpunk

---

Previous in thread (79 of 109): 🗣️ Petite Abeille (petite.abeille (a) gmail.com)

Next in thread (81 of 109): 🗣️ Solderpunk (solderpunk (a) posteo.net)

View entire thread.