💾 Archived View for gemi.dev › gemini-mailing-list › 000583.gmi captured on 2023-11-04 at 12:57:28. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
"Uniform Resource Locators were defined in RFC 1738 in 1994 by Tim Berners-Lee" ~26 years ago. We must have a good understanding of what they are by now. An URI can -in plain US-ASCII- represent all UTF-8 code points. Therefore all of Unicode. An IRI can -in plain UTF-8- represent all of Unicode. No need for US-ASCII encoding. They are wholly equivalent in term of content, and only differs in terms of encoding. The only visible difference between URI and IRI is how Unicode is conveyed: ASCII encoded in URI, UTF-8 in IRI. They are otherwise identical in all aspects. The following Gemini response is in vanilla US-ASCII: 20 text/gemini;charset=us-ascii; => gemini://xn--el8h/%F0%9F%91%B9.gmi And yet, because of the nature of URI, it contains two Unicode characters; US-ASCII transmitted; UTF-8 encoded. Here is the exact same response, but in UTF-8, due to IRI in the link: 20 text/gemini;charset=utf-8; => gemini://?/?.gmi Both responses represent exactly the same content. They are only encoded differently. Both contain UTF-8, and therefore Unicode. Both are identical. HTH.
URI's use UTF-8 encoded octets only by popular convention and not by any hard rule. You can stick any kind of binary data into a URI as long as you percent-encode the non-ASCII bytes. https://tools.ietf.org/html/rfc3986#section-2.5 For example, this file I just threw on my apache web server: http://mozz.us/%80.txt - Michael
> On Dec 30, 2020, at 00:13, Michael Lazar <lazar.michael22 at gmail.com> wrote: > > URI's use UTF-8 encoded octets only by popular convention and not by > any hard rule. You can stick any kind of binary data into a URI as > long as you percent-encode the non-ASCII bytes. Yes, indeed. Any random binary will do, e.g. the query portion could contain any weird binary data one sees fit to put there. Not so much in other parts of the URI though, UTF-8 rules there.
On Wed, Dec 30, 2020 at 12:25:31AM +0100, Petite Abeille <petite.abeille at gmail.com> wrote a message of 14 lines which said: > > URI's use UTF-8 encoded octets only by popular convention and not by > > any hard rule. You can stick any kind of binary data into a URI as > > long as you percent-encode the non-ASCII bytes. > > Yes, indeed. Any random binary will do, e.g. the query portion could > contain any weird binary data one sees fit to put there. > > Not so much in other parts of the URI though, UTF-8 rules there. This is not true. As Michael said, URI are bytes, not characters. The encoding is anyone's guess. Two details:
> On Jan 3, 2021, at 14:55, Stephane Bortzmeyer <stephane at sources.org> wrote: > > This is not true. As Michael said, URI are bytes, not characters. The > encoding is anyone's guess. And yet it moves. https://en.wikipedia.org/wiki/And_yet_it_moves And no, it's not "anyone's guess", it's de facto in UTF-8. And that's that. > .* the RFC has provisions for "a new URI scheme" which may apply to > us. We can decide here that URI of scheme "gemini" MUST be entirely in > UTF-8. +1 ? ???
---
Previous Thread: [tech] [eli5] US-ASCII is a subset of UTF-8. UTF-8 is a superset of US-ASCII.
Next Thread: [tech] Any chagelog or older version of the Spec?