[spec] <URL> is a UTF-8 erratum

📧 Messages: 3
🗣️ Authors: 2
📅 First Message: 2020-12-26 01:34
📅 Last Message: 2020-12-26 15:25

1. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-26 01:34
📧 Message 1 of 3

The spec says, as of v0.14.3, November 29th 2020, under # 2 Gemini requests:

<URL> is a UTF-8 encoded absolute URL, including a scheme, of maximum length 1024 bytes. 

This is wrong.

What's wrong is URL. It should read IRI instead, to make it consistent 
with UTF-8. An URL cannot be UTF-8, while an IRI can. As UTF-8 precedes 
URL in the sentence, UTF-8 has to take precedence.

I suggest the following correction:

<URL> is a UTF-8 encoded absolute IRI, including a scheme, of maximum 
length 4,096* bytes. 


	Note the increase in size, to compensate for the consistent use of UTF-8 encoding.


To avoid confusion further down the document (inconsistent use of URL vs. 
URI. vs IRI), I would suggest to systematically refer to IRI for all of them.

Which gives us:

<IRI> is a UTF-8 encoded absolute IRI, including a scheme, of maximum 
length 4,096 bytes.

Link to individual message.

2. Solderpunk (solderpunk (a) posteo.net)

📅 Sent: 2020-12-26 15:01
📧 Message 2 of 3

On Sat Dec 26, 2020 at 2:34 AM CET, Petite Abeille wrote:
> The spec says, as of v0.14.3, November 29th 2020, under # 2 Gemini
> requests:
>
> <URL> is a UTF-8 encoded absolute URL, including a scheme, of maximum
> length 1024 bytes.
>
> This is wrong.
>
> What's wrong is URL. It should read IRI instead, to make it consistent
> with UTF-8. An URL cannot be UTF-8, while an IRI can. As UTF-8 precedes
> URL in the sentence, UTF-8 has to take precedence.

I've already made it very clear in the main [spec] thread for this topic
that whether we adopt IRIs or not, the use of "UTF-8 encoded URL" should
be fixed, as it is likely to cause confusion.  Rest assured, when the
decision is made, this will be fixed accordingly.  There's no need to
split the thread over this.

It's not true that URLs cannot be UTF-8.  In fact, the opposite is true.
The characters valid in a URL are a subset of those which can be
represented in ASCII, and all byte strings which are valid ASCII are also
valid UTF-8, with equivalent decodings.  Hence, *every* URL is UTF-8.
So there. :p

The idea that, as a general principle, when an existing sentence in the
spec is ambiguous or inconsistent, the problem should be resolved by
granting absolute priority to whichever term occurs first in the current
form of the sentence is, plainly, absurd.

It's probably not a hill I'll choose to die on when it comes time to
update the spec based on the IRI decision, but personally I reject the
modern notion that the URL/URN (or IRL/IRN) distinction is not important
and thus everything should be specified at maximum generality as a URI
(or IRN).  The difference matters and protocols/formats should choose
appropriately.  The practical question of how to handle a text/gemini
document becomes *considerably* murkier when link lines contain URNs (of
course, you and I know this quite well already).

Cheers,
Solderpunk

Link to individual message.

3. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-26 15:25
📧 Message 3 of 3



> On Dec 26, 2020, at 16:01, Solderpunk <solderpunk at posteo.net> wrote:
> 
> Hence, *every* URL is UTF-8. So there. :p

A subset thereof :D

> plainly, absurd.

Yes. And yet, we need a bit of formalism in the spec. 

>  (of course, you and I know this quite well already).

Indeed. The rabbit hole is deep.

Link to individual message.

---

Previous Thread: [tech] [spec] Decide on use of URL fragment

Next Thread: [spec] adding a "magic number" for gemini files