💾 Archived View for gemi.dev › gemini-mailing-list › 000582.gmi captured on 2024-05-26 at 16:13:47. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

[tech] [eli5] US-ASCII is a subset of UTF-8. UTF-8 is a superset of US-ASCII.

1. Petite Abeille (petite.abeille (a) gmail.com)

📅 Sent: 2020-12-29 10:02
📧 Message 1 of 1

"UTF-8 was first officially presented at the USENIX conference in San Diego in 1993... "

Nearly 28 years ago. We must have a clear understanding of what it is by now.

When something can handle US-ASCII, does *not* imply it can handle UTF-8.

When something can handle UTF-8, it *does* imply it can handle US-ASCII.

US-ASCII is a subset of UTF-8. UTF-8 is a superset of US-ASCII. 

UTF stands for Unicode Transformation Format.

It carries all the complexity of Unicode, plus some, namely validation.

Including Unicode normalization.

The very same character can be be represented by various code-point 
sequences in Unicode. 
And therefore in UTF-8 text. 
And therefore in punycode. 
And therefore in URL encoding.
And therefore in URI encoding.
And therefore in IRI.

This is why, before any interactions can take place, both parties MUST 
normalize Unicode the same way. And then encode/decode it.

There is no way around this requirement if we say that we support UTF-8, 
and therefore Unicode.

All this applies wherever UTF-8 can appear: in a request URL, in 
text/gemini, in a text/gemini link url.

Making UTF-8 a MUST makes Unicode a MUST. This is the way.

If Gemini does not want to deal with the complexity of Unicode, Gemini 
MUST change the specification to read:

Clients MUST support US-ASCII, and SHOULD support UTF-8.

Gemini cannot have it both ways, it's either UTF-8, and therefore Unicode, 
or plain US-ASCII.

HTH.

Link to individual message.

---

Previous Thread: [user] new capsule

Next Thread: [tech] [eli5] URI = IRI = ASCII = UTF-8 = Unicode