💾 Archived View for gemi.dev › gemini-mailing-list › 000583.gmi captured on 2023-11-04 at 12:57:28. Gemini links have been rewritten to link to archived content

View Raw

More Information

➡️ Next capture (2023-12-28)

-=-=-=-=-=-=-

[tech] [eli5] URI = IRI = ASCII = UTF-8 = Unicode

Petite Abeille <petite.abeille (a) gmail.com>

"Uniform Resource Locators were defined in RFC 1738 in 1994 by Tim Berners-Lee"

~26 years ago. We must have a good understanding of what they are by now.


An URI can -in plain US-ASCII- represent all UTF-8 code points. Therefore all of Unicode.
An IRI can -in plain UTF-8- represent all of Unicode. No need for US-ASCII encoding.

They are wholly equivalent in term of content, and only differs in terms of encoding. 

The only visible difference between URI and IRI is how Unicode is 
conveyed: ASCII encoded in URI, UTF-8 in IRI.

They are otherwise identical in all aspects.


The following Gemini response is in vanilla US-ASCII:

20 text/gemini;charset=us-ascii;
=> gemini://xn--el8h/%F0%9F%91%B9.gmi

And yet, because of the nature of URI, it contains two Unicode characters; 
US-ASCII transmitted; UTF-8 encoded.

Here is the exact same response, but in UTF-8, due to IRI in the link:

20 text/gemini;charset=utf-8;
=> gemini://?/?.gmi


Both responses represent exactly the same content. They are only encoded differently.

Both contain UTF-8, and therefore Unicode. Both are identical. 

HTH.

Link to individual message.

Michael Lazar <lazar.michael22 (a) gmail.com>

URI's use UTF-8 encoded octets only by popular convention and not by
any hard rule. You can stick any kind of binary data into a URI as
long as you percent-encode the non-ASCII bytes.

https://tools.ietf.org/html/rfc3986#section-2.5

For example, this file I just threw on my apache web server:

http://mozz.us/%80.txt

- Michael

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>



> On Dec 30, 2020, at 00:13, Michael Lazar <lazar.michael22 at gmail.com> wrote:
> 
> URI's use UTF-8 encoded octets only by popular convention and not by
> any hard rule. You can stick any kind of binary data into a URI as
> long as you percent-encode the non-ASCII bytes.

Yes, indeed. Any random binary will do, e.g. the query portion could 
contain any weird binary data one sees fit to put there.

Not so much in other parts of the URI though, UTF-8 rules there.

Link to individual message.

Stephane Bortzmeyer <stephane (a) sources.org>

On Wed, Dec 30, 2020 at 12:25:31AM +0100,
 Petite Abeille <petite.abeille at gmail.com> wrote 
 a message of 14 lines which said:

> > URI's use UTF-8 encoded octets only by popular convention and not by
> > any hard rule. You can stick any kind of binary data into a URI as
> > long as you percent-encode the non-ASCII bytes.
> 
> Yes, indeed. Any random binary will do, e.g. the query portion could
> contain any weird binary data one sees fit to put there.
> 
> Not so much in other parts of the URI though, UTF-8 rules there. 

This is not true. As Michael said, URI are bytes, not characters. The
encoding is anyone's guess.

Two details:


are special rules for hostnames.


us. We can decide here that URI of scheme "gemini" MUST be entirely in
UTF-8.

Link to individual message.

Petite Abeille <petite.abeille (a) gmail.com>


> On Jan 3, 2021, at 14:55, Stephane Bortzmeyer <stephane at sources.org> wrote:
> 
> This is not true. As Michael said, URI are bytes, not characters. The
> encoding is anyone's guess.

And yet it moves.

https://en.wikipedia.org/wiki/And_yet_it_moves

And no, it's not "anyone's guess", it's de facto in UTF-8.

And that's that.

> .* the RFC has provisions for "a new URI scheme" which may apply to
> us. We can decide here that URI of scheme "gemini" MUST be entirely in
> UTF-8.

+1

? ???

Link to individual message.

---

Previous Thread: [tech] [eli5] US-ASCII is a subset of UTF-8. UTF-8 is a superset of US-ASCII.

Next Thread: [tech] Any chagelog or older version of the Spec?