Unicode vs. the World

🗣️ From: Sean Conner (sean (a) conman.org)
📅 Sent: 2020-12-16 23:41
📧 Message 18 of 34

It was thus said that the Great Bj?rn W?rmedal once stated:
> 
> > but unreasonable for it to have to urlencode the path (a common encoding
> > for which libraries are ubiquitous)?
> 
> Because ? as I tried to point out ? there is no reasonably simple
> heuristic for determining whether a URL is already percent encoded or not.
> And percent encoding a URL that is already percent encoded exchanges all %
> characters with %25. Attempting to punycode a domain name that is already
> punycoded, however, changes nothing at all. No heuristics are needed, the
> client can just punycode everything.

  I can't say for certain what most clients do, but I'm under the impression
that some (the majority?) use some existing library to parse links.  The
specification states that relative links are allowed in text/gemini:

=> ../%F0%9D%92%BB%F0%9D%92%B6%F0%9D%93%83%F0%9D%92%B8%F0%9D%93%8E.txt 
Some ????? stuff here

but a full URI needs to be sent to the server, so some processing of the
link is required (specifically, section 5.2 of RFC-3986).  And existing
libraries help here.  The library I'm currently using will parse the above
link into the following structure:

	{
	  path = "../?????.txt"
	}

  Note how the text has been translated and any percent encoding has been
decoded.  Next, the base URL of the page:

	gemini://example.com/files/others/

has previously parsed (because it was needed to retrieve the page currently
being viewed):

	{
	  path = "/files/others/",
	  port = 1965.000000,
	  host = "example.com",
	  scheme = "gemini",
	}

  The two are then merged into a single reference:

	{
	  path = "/files/?????.txt"
	  port = 1965.000000,
	  host = "example.com",
	  scheme = "gemini",
	}

  Then to make a request, this new link is converted into a URI to make the
request:

	gemini://example.com/files/%F0%9D%92%BB%F0%9D%92%B6%F0%9D%93%83%F0%9D%92%B
8%F0%9D%93%8E.txt

  As you can see, that process has re-encoded the path, percent-encoding it.
I would expect that some (the majority?) of clients are doing something
similar to this---doing a conversion from percent-encoding, marging
references, then converting to percent-encoding (except for the host, which
needs to be converted to punycode).

  It would be instructive to know how clients are handling this---do they
decode percent-encoded data, merge the base link to the relative link and
re-encode?  Or something different?

  -spc

---

Previous in thread (17 of 34): 🗣️ ew.gemini (ew.gemini (a) nassur.net)

Next in thread (19 of 34): 🗣️ colecmac (a) protonmail.com (colecmac (a) protonmail.com)

View entire thread.