πΎ Archived View for gemi.dev βΊ gemini-mailing-list βΊ 000580.gmi captured on 2024-08-31 at 17:38:13. Gemini links have been rewritten to link to archived content
β¬ οΈ Previous capture (2023-12-28)
-=-=-=-=-=-=-
This is not fully expressed in the specification, but practically, "all" text/gemini documents are either UTF-8 or US-ASCII encoded. Stephane Bortzmeyer compiled the following list from his crawler: > Only for text/gemini: > > * Unspecified: 5997 > * utf-8: 4619 > * tcvn-5712: 2 > * cp437: 2 > * utf-16be: 1 > * utf-16: 1 > * windows-1252: 1 > * utf-32le: 1 > * utf-32be: 1 > * utf-16le: 1 > * ebcdicatde: 1 > > But wait, all the exotic charsets are at <gemini://egsam.pitr.ca/> > which is a test site for various funny stuff. So, it is safe to say > that not one "real" gemtext resource uses something else than UTF-8. While it is the case that impact is minimal, I suggest that the specification reflects the much simpler situation these statistics indicate rather than keep itself open to the general problem of representing text/gemini in encodings that might not even have the meta information characters encoded in the same way, and?if IRIs are introduced?creates the problem of how IRIs should be represented in e.g. ISO-8859-1. I understand the need for other document types to take other character encodings. For example, I have a collection of old text files in IBM437 encoding. For text/gemini, we pretty much have a blank slate, though, and I see no reason that it should extend to support arbitrary encodings when limiting to UTF-8 creates a much simpler situation for implementers and is already the unspoken standard. There are display systems and platforms that fundamentally can't display UTF-8 directly. For example, in the PC text modes I am limited to IBM437. The problem of transcoding text/gemini should then lie with the client authors for those platforms, not with every other client author. ELinks for DOS will for example transcode UTF-8 (and various other encodings) to IBM437 and use a placeholder character where no equivalents exist. -- Philip -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201228/f376 8982/attachment.sig>
Le lundi 28 d?cembre 2020, 14:16:27 CET Philip Linde a ?crit : > I understand the need for other document types to take other character > encodings. For example, I have a collection of old text files in IBM437 > encoding. For text/gemini, we pretty much have a blank slate, though, > and I see no reason that it should extend to support arbitrary > encodings when limiting to UTF-8 creates a much simpler situation for > implementers and is already the unspoken standard. The main reason I see for authorizing other encodings is to be future proof. When people designed old protocols they thought ASCII was here to stay. So, I think we should learn from the past and not set in stone that all files must use utf-8, maybe something else will arise and be better for some unsuspected reason, and people will want to use that. C?me
> Le lundi 28 d?cembre 2020, 14:16:27 CET Philip Linde a ?crit : > > I understand the need for other document types to take other character > > encodings. For example, I have a collection of old text files in IBM437 > > encoding. For text/gemini, we pretty much have a blank slate, though, > > and I see no reason that it should extend to support arbitrary > > encodings when limiting to UTF-8 creates a much simpler situation for > > implementers and is already the unspoken standard. The spec says that "Compliant clients MUST support UTF-8-encoded text/* responses. Clients MAY optionally support other encodings". So, the argument that we should make things simpler for implementers does not really carry much weight here. It's 100% okay to write a client which (gracefully) refuses to handle any encoding other than UTF-8. People who want to serve text/gemini content with some other encoding can, but they have no right to complain when only a subset (potentially a very small one) of people can view said content. This all seems fine to me. Nobody is required or expected to support anything difficult or unusual, but if some group of people all decide they want to do something difficult or unusual for some strange reason, and they're willing to do the work required, then nobody can tell them they're doing anything wrong. Cheers, Solderpunk
On Mon, 28 Dec 2020 14:25:29 +0100 C?me Chilliet <come at chilliet.eu> wrote: > The main reason I see for authorizing other encodings is to be future proof. > > When people designed old protocols they thought ASCII was here to stay. On the other hand, when people adopted Unicode 5.0, they did so fully aware that there would likely be a Unicode 6.0, 7.0, 8.0 etc. Unicode is future proof in the sense that it includes a process for updating itself. The comparison to ASCII in that sense does not consider the entirely different approaches these standards take. Where ASCII is fixed and limited to a relatively tiny set of characters, Unicode is deliberately open to amendment. > So, I think we should learn from the past and not set in stone that all files must use utf-8, maybe something else will arise and be better for some unsuspected reason, and people will want to use that. In such an event we can under the current spec either expect an effective split of geminispace around which clients support what encodings, or widespread client updates. This is not a very different situation from a change to the fixed spec. -- Philip -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201228/fbca 2a56/attachment.sig>
On Mon, 28 Dec 2020 14:30:38 +0100 "Solderpunk" <solderpunk at posteo.net> wrote: > The spec says that "Compliant clients MUST support UTF-8-encoded text/* > responses. Clients MAY optionally support other encodings". So, the > argument that we should make things simpler for implementers does not > really carry much weight here. It's 100% okay to write a client which > (gracefully) refuses to handle any encoding other than UTF-8. I am not so interested in what is okay or not in the abstract. As a client author, the ideal situation for me is that my client supports the entire per-specification geminispace. The specification currently makes this a much harder problem than it would be if text/gemini documents were limited to UTF-8. In fact, it's an open-ended problem that's subject to change (as new encodings are introduced) and interpretation (concerning what sequence of bytes represents e.g. "=>" in a particular encoding, or how to transliterate URI to ASCII or IRI to UTF-8). Thankfully, geminispace seems to have settled on UTF-8, which is why I think this is a good time to tie that end up. > People > who want to serve text/gemini content with some other encoding can, but > they have no right to complain when only a subset (potentially a very > small one) of people can view said content. This all seems fine to me. > Nobody is required or expected to support anything difficult or unusual, > but if some group of people all decide they want to do something > difficult or unusual for some strange reason, and they're willing to do > the work required, then nobody can tell them they're doing anything > wrong. Perhaps there is a great argument for allowing other encodings that makes this an acceptable outcome, but a hypothetical effective split of geminispace around which encodings are used and which clients support them doesn't sound desirable in itself. We can turn the question around and instead ask what motivates the inclusion of each supported encoding (or arbitrary encodings in general, a simpler question). -- Philip -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201228/12e2 6723/attachment-0001.sig>
Note also, regarding the current discussion about IRI, that if IRI is adopted and the community later adopts a different (UTF-8 incompatible) standard encoding, the way IRI is implemented has to change, or we lose its benefits entirely when we implement a galaxy-grade encoding in the year 2121 and still have to deal with UTF-8 IRI. Even if IRI is decided against, by requiring URI at all, we're betting that this hypothetical future encoding will be ASCII-compatible, or we're giving client authors the additional burden of transcoding. -- Philip -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201228/c7b8 a00b/attachment.sig>
It was thus said that the Great Solderpunk once stated: > > Le lundi 28 d?cembre 2020, 14:16:27 CET Philip Linde a ?crit : > > > > I understand the need for other document types to take other character > > > encodings. For example, I have a collection of old text files in IBM437 > > > encoding. For text/gemini, we pretty much have a blank slate, though, > > > and I see no reason that it should extend to support arbitrary > > > encodings when limiting to UTF-8 creates a much simpler situation for > > > implementers and is already the unspoken standard. > > The spec says that "Compliant clients MUST support UTF-8-encoded text/* > responses. Clients MAY optionally support other encodings". I would ammend that to read "Compliant clients MUST support UTF-8 and US-ASCII encoded text/* reponses." This is because US-ASCII is a proper subset of UTF-8, and any valid ASCII file is also a valid UTF-8 file. I'm thinking here of automated MIME detection (ala libmagic) that might return a MIME type of 'text/plain; charset=us-ascii' for a text file. -spc
> On Dec 29, 2020, at 00:35, Sean Conner <sean at conman.org> wrote: > > I would ammend that to read "Compliant clients MUST support UTF-8 and > US-ASCII encoded text/* reponses." This is wholly redundant. UTF-8 is, by design, a /superset/ of US-ASCII. But the reverse is not true, obviously. Therefore the endless confusion. Drop the US-ASCII holdovers. Embrace UTF-8. Move on. My 2?.
> On Dec 29, 2020, at 09:00, Petite Abeille <petite.abeille at gmail.com> wrote: > > Drop the US-ASCII holdovers. Embrace UTF-8. Move on. If this is too much to swallow, then we have to phrase is the other way around: "Clients MUST support US-ASCII, and SHOULD support UTF-8" The same applies to the request URL, and text/gemini links.
It was thus said that the Great Petite Abeille once stated: > > On Dec 29, 2020, at 00:35, Sean Conner <sean at conman.org> wrote: > > > > I would ammend that to read "Compliant clients MUST support UTF-8 and > > US-ASCII encoded text/* reponses." > > This is wholly redundant. UTF-8 is, by design, a /superset/ of US-ASCII. Not it's not. Here's the origial text: > The spec says that "Compliant clients MUST support UTF-8-encoded text/* > responses. Clients MAY optionally support other encodings". Per this wording, any client that receives "text/plain; charset=us-ascii" is allowed to just drop it on the floor and do absolutely nothing with it. Some here migh actually prefer that, but "text/plain; charset=us-ascii" is also "text/plain; charset=utf-8", that is, a client *can* do something meaningful with it, unlike "text/plain; charset=CSISOLATIN3". > Drop the US-ASCII holdovers. Embrace UTF-8. Move on. Why do you hate textfiles.com? > My 2?. -spc (my $.02)
> On Dec 29, 2020, at 10:03, Sean Conner <sean at conman.org> wrote: > > Per this wording, any client that receives "text/plain; charset=us-ascii" > is allowed to just drop it on the floor and do absolutely nothing with it. Nonsense. A compliant client MUST support UTF-8. US-ASCII is a strict subset of UTF-8. Therefore a compliant client supports US-ASCII out-of-the-box. Nothing more, and nothing less. > Why do you hate textfiles.com? Haters gonna hate :P Isn't it, like, 3am in your timezone? Go back to bed :)
On Tue, Dec 29, 2020 at 10:11 AM Petite Abeille <petite.abeille at gmail.com> wrote: > > On Dec 29, 2020, at 10:03, Sean Conner <sean at conman.org> wrote: > > > > Per this wording, any client that receives "text/plain; charset=us-ascii" > > is allowed to just drop it on the floor and do absolutely nothing with it. > Nonsense. A compliant client MUST support UTF-8. US-ASCII is a strict subset of UTF-8. Therefore a compliant client supports US-ASCII out-of-the-box. Nothing more, and nothing less. A car contains people. Therefore people are cars. Petite, you are confusing Is-A and Has-A relationships [1][2]. UTF-8 is a ("separate" from US-ASCII) character encoding that contains ASCII charset. If the spec said "clients MUST support ONLY UTF-8" then any pages specifying "charset=us-ascii" must result in an error. [1] https://en.wikipedia.org/wiki/Is-a [2] https://en.wikipedia.org/wiki/Has-a Back to a more productive topic, the wording in the spec - "clients MUST support UTF-8 encoded responses" - is ambiguous and doesn't actually mean that acceptable value for "charset" must include "utf-8", and says nothing about what values of "charset" are acceptable. It says that clients must at the very least try to decode response using UTF-8 charset decoder. Responses encoded with US-ASCII and UTF-8 (and UTF-PETER, which is a random subset of UTF-8) will indeed work. Looking at latest stats on gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi it looks like UTF-8 (this includes unspecified charsets which per spec default to UTF-8) is used by 81% of pages, US-ASCII accounts for 17%. Given this, I suggest the spec be rephrased such that it instead specifies minimum acceptable values of "charset" (specifically us-ascii and utf-8).
It was thus said that the Great Peter Vernigorov once stated: > > Looking at latest stats on > gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi it looks like > UTF-8 (this includes unspecified charsets which per spec default to > UTF-8) is used by 81% of pages, US-ASCII accounts for 17%. > > Given this, I suggest the spec be rephrased such that it instead > specifies minimum acceptable values of "charset" (specifically > us-ascii and utf-8). Agreed. And looking at the stats from GUS [1], text/plain is more popular than text/gemini (by over 2:1) and UTF-8 to US-ASCII is 54% to 46%. -spc [1] gemini://gus.guru/statistics
> On Dec 29, 2020, at 22:24, Peter Vernigorov <pitr.vern at gmail.com> wrote: > > Back to a more productive topic, the wording in the spec - "clients > MUST support UTF-8 encoded responses" - is ambiguous and doesn't > actually mean that acceptable value for "charset" must include > "utf-8", and says nothing about what values of "charset" are > acceptable. Confused indeed. Are you making a distinction between UTF-8 the encoding vs. utf-8 the charset? Is there such difference? What would that difference be? I feel out of depth. But ok, if that helps in term of clarity of purpose, then more power to the spec by spelling it out.
> looking at the stats from GUS [1], text/plain is more popular than > text/gemini (by over 2:1) If I remember correctly, most of that plaintext is from an RFC mirror somewhere, I'm sure the actual stats of stuff written and intended for gemini, not including mirrored content, differs significantly from GUS' report. -- Alex // nytpu alex at nytpu.com GPG Key: https://www.nytpu.com/files/pubkey.asc Key fingerprint: 43A5 890C EE85 EA1F 8C88 9492 ECCD C07B 337B 8F5B https://useplaintext.email/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201229/7568 1097/attachment.sig>
> On Dec 29, 2020, at 22:24, Peter Vernigorov <pitr.vern at gmail.com> wrote: > > Looking at latest stats on > gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi it looks like > UTF-8 (this includes unspecified charsets which per spec default to > UTF-8) is used by 81% of pages, US-ASCII accounts for 17%. The actual numbers are as follow: ? Unspecified: 39628 ? us-ascii: 9995 ? utf-8: 7090 ( 56,713 total) It's not clear if this pertain to the 36,477 text/gemini documents only, or the entire dataset (57,164 url vs. 56,713 encodings. 451 MIA). Looking at the numbers I guess it covers the entire data set as there are more 'Unspecified' than 'text/gemini' to start with. I'm not sure what these numbers mean at all, but they are not describing text/gemini. Not sure why we would draw any conclusion from them in regards to text/gemini.
On Wed, Dec 30, 2020 at 00:04 Petite Abeille <petite.abeille at gmail.com> wrote: > > > > On Dec 29, 2020, at 22:24, Peter Vernigorov <pitr.vern at gmail.com> wrote: > > > > Looking at latest stats on > > gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi it looks like > > UTF-8 (this includes unspecified charsets which per spec default to > > UTF-8) is used by 81% of pages, US-ASCII accounts for 17%. > > The actual numbers are as follow: > > ? Unspecified: 39628 > ? us-ascii: 9995 > ? utf-8: 7090 > ( 56,713 total) > > It's not clear if this pertain to the 36,477 text/gemini documents only, > or the entire dataset (57,164 url vs. 56,713 encodings. 451 MIA). > Could you clarify which part is unclear to you here? 56,713 is, by design, a strict /superset/ of 36k :) > Looking at the numbers I guess it covers the entire data set as there are > more 'Unspecified' than 'text/gemini' to start with. > > I'm not sure what these numbers mean at all, but they are not describing > text/gemini. > > Not sure why we would draw any conclusion from them in regards to > text/gemini. > While it?s true that the thread subject mentions text/gemini, the oft quoted part of the spec is in section ?3.3 Response bodies? and talks about any text/* responses. The only mention of charset in section 5 (which describes text/gemini) is a reference to 3.3. Also, looking at stats of either entire dataset or only text/gemini shows the same picture: utf-8 and us-ascii account for ~99% of all charset values. -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201230/ea7d ac64/attachment.htm>
> On Dec 30, 2020, at 03:31, Peter Vernigorov <pitr.vern at gmail.com> wrote: > > Could you clarify which part is unclear to you here? 56,713 is, by design, a strict /superset/ of 36k :) > Math is hard, let's go shopping. I'm sure it will all make sense at the very, very end.
> On Dec 30, 2020, at 03:31, Peter Vernigorov <pitr.vern at gmail.com> wrote: > > , the oft quoted part of the spec is in section ?3.3 Response bodies? and talks about any text/* responses. Ohhhh... right you are, I was assuming text/gemini only. My bad. This sounds like a major overreach. Shouldn't Gemini restrict itself to just text/gemini as far as the Gemini spec goes? On what ground would Gemini redefine what a text content type is? A MIME content type of "text" is "text/plain; charset=us-ascii" by default. I don't quite see the point of redefining how a major piece of MIME is defined. Sounds counterproductive.
> On Dec 30, 2020, at 03:31, Peter Vernigorov <pitr.vern at gmail.com> wrote: > > ?3.3 Response bodies? and talks about any text/* responses. Actually, this is counterproductive, and wrong, technically speaking. Consider the following response: 20 text/html ... While HTML5 is UTF-8 by default, most vintage html is ISO-8859-1. And there is a lot of vintage to go around. Defaulting to UTF-8 for all of text/* at large would break the interweb as we know it. Why take on such burden? Perhaps best to narrow the spec to only speak about text/gemini. Other text/* media types have their own idiosyncrasies. Best to leave them alone. ?2?
On Tue, Dec 29, 2020 at 04:37:08PM -0500, Sean Conner <sean at conman.org> wrote a message of 17 lines which said: > > Looking at latest stats on > > <gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi> it looks > > like UTF-8 (this includes unspecified charsets which per spec > > default to UTF-8) is used by 81% of pages, US-ASCII accounts for > > 17%. > And looking at the stats from GUS [1], text/plain is more popular > than text/gemini (by over 2:1) and UTF-8 to US-ASCII is 54% to 46%. This is because it includes a lot of text/plain. I've just modified the stats at <gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi> to have a special tally for text/gemini and UTF-8 has a quasi-monopoly.
On Mon, Dec 28, 2020 at 02:16:27PM +0100, Philip Linde <linde.philip at gmail.com> wrote a message of 69 lines which said: > While it is the case that impact is minimal, I suggest that the > specification reflects the much simpler situation these statistics > indicate rather than keep itself open to the general problem of > representing text/gemini in encodings that might not even have the > meta information characters encoded in the same way, and?if IRIs are > introduced?creates the problem of how IRIs should be represented in > e.g. ISO-8859-1. Note also that saying "gemtexts MUST be in UTF-8" is not everything. We may (or may be not) also want to mandate end-of-lines (they can be represented with CR, LF, CR-LF, LS or PS, the last two being purely Unicode, not present in ASCII) and normalization. If we go that way, there is an existing standard for Unicode text, RFC 5198 <gemini://gemini.bortzmeyer.org/rfc-mirror/rfc5198.txt>. It mandates CR-LF and normalization NFC.
> On Jan 3, 2021, at 14:49, Stephane Bortzmeyer <stephane at sources.org> wrote: > > It mandates CR-LF and normalization NFC. RFC5198. Yes. Normalization, normalization. ? ???
> On Jan 3, 2021, at 14:46, Stephane Bortzmeyer <stephane at sources.org> wrote: > > UTF-8 has a quasi-monopoly. Not quite. For text/gemini, your stats read: ? Unspecified: 42,322 ? utf-8: 6,513 ? us-ascii: 3 Unspecified rules. By far. Most likely plain ASCII in practice. Could you run #file --mime-type --mime-encoding on all these text/gemini? $ openssl s_client -quiet -crlf -connect mozz.us:1965 <<< gemini://mozz.us/ 2>/dev/null | file --brief --mime-type --mime-encoding - text/plain; charset=utf-8 Validating the encoding would be informative as well: $ openssl s_client -quiet -crlf -connect mozz.us:1965 <<< gemini://mozz.us/ 2>/dev/null | iconv -f utf-8 -t utf-8 > /dev/null; echo $? 0 Ditto for guessing the actual language: # echo $(openssl s_client -quiet -crlf -connect mozz.us:1965 <<< gemini://mozz.us/ 2>/dev/null ) | polyglot detect | cut -d' ' -f1 | uniq English https://polyglot.readthedocs.io/en/latest/Detection.html ? ???
Le dimanche 3 janvier 2021, 17:02:54 CET Petite Abeille a ?crit : > > On Jan 3, 2021, at 14:46, Stephane Bortzmeyer <stephane at sources.org> wrote: > > UTF-8 has a quasi-monopoly. > > Not quite. > > For text/gemini, your stats read: > > ? Unspecified: 42,322 > ? utf-8: 6,513 > ? us-ascii: 3 > > Unspecified rules. By far. Most likely plain ASCII in practice. No, the specification specifies that default is utf-8, so unspecified is utf-8. I do not set the charset in my server headers as it is redundant because I always send utf-8. > Ditto for guessing the actual language: > > # echo $(openssl s_client -quiet -crlf -connect mozz.us:1965 <<< gemini://mozz.us/ 2>/dev/null ) | polyglot detect | cut -d' ' -f1 | uniq > English > > https://polyglot.readthedocs.io/en/latest/Detection.html Language is not the same, because the specification explicitely says that there is no default, so my server always send the lang= header tag for text/gemini content. C?me
> On Jan 3, 2021, at 17:11, C?me Chilliet <come at chilliet.eu> wrote: > > No, the specification specifies that default is utf-8, so unspecified is utf-8. Precisely my point. Thanks. ? ???
> On Jan 3, 2021, at 17:11, C?me Chilliet <come at chilliet.eu> wrote: > > Language is not the same, because the specification explicitely says that there is no default, so my server always send the lang= header tag for text/gemini content. No one said it was the same. But it would be interesting to know. ? ???
> On Jan 3, 2021, at 17:11, C?me Chilliet <come at chilliet.eu> wrote: > > I do not set the charset in my server headers as it is redundant because I always send utf-8. What's your server? We can validate that promptly. ? ???
Le dimanche 3 janvier 2021, 17:13:10 CET Petite Abeille a ?crit : > > > On Jan 3, 2021, at 17:11, C?me Chilliet <come at chilliet.eu> wrote: > > > > No, the specification specifies that default is utf-8, so unspecified is utf-8. > > Precisely my point. Thanks. No, you said ?Not quite.? when Stephane said UTF-8 had quasi-monopoly, and you rectified ?Unspecified rules?. Unspecified is UTF-8, by specification. So UTF-8 does have a quasi-monopoly. > No one said it was the same. But it would be interesting to know. The fact that you made a difference between unspecified and utf-8 for encoding, and that you provided both tools to detect encoding and language made it seem like you were considering both cases the same, while encoding is always known (unspecified == utf-8) while language is sometimes unknown. I also explicitely pointed this difference to encourage people to specify the language in their Gemini headers, as a lot of Gemini pages currently have an unspecified language. > What's your server? We can validate that promptly. Please stop splitting your answers in several emails like this, it?s uselessly filling the mailing list and making discussions harder to follow. I do not need to validate that, I know that my server header does not contain a charset tag. C?me
---
Previous Thread: [user] [ot] flame warriors: a full taxonomy