πŸ’Ύ Archived View for gemi.dev β€Ί gemini-mailing-list β€Ί 000580.gmi captured on 2024-08-31 at 17:38:13. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2023-12-28)

-=-=-=-=-=-=-

[spec] Limit valid encodings of text/gemini to UTF-8

1. Philip Linde (linde.philip (a) gmail.com)

This is not fully expressed in the specification, but practically, "all"
text/gemini documents are either UTF-8 or US-ASCII encoded. Stephane
Bortzmeyer compiled the following list from his crawler:

> Only for text/gemini:
> 
> * Unspecified: 5997
> * utf-8: 4619
> * tcvn-5712: 2
> * cp437: 2
> * utf-16be: 1
> * utf-16: 1
> * windows-1252: 1
> * utf-32le: 1
> * utf-32be: 1
> * utf-16le: 1
> * ebcdicatde: 1
> 
> But wait, all the exotic charsets are at <gemini://egsam.pitr.ca/>
> which is a test site for various funny stuff. So, it is safe to say
> that not one "real" gemtext resource uses something else than UTF-8.

While it is the case that impact is minimal, I suggest that the
specification reflects the much simpler situation these statistics
indicate rather than keep itself open to the general problem of
representing text/gemini in encodings that might not even have the meta
information characters encoded in the same way, and?if IRIs are
introduced?creates the problem of how IRIs should be represented in
e.g. ISO-8859-1.

I understand the need for other document types to take other character
encodings. For example, I have a collection of old text files in IBM437
encoding. For text/gemini, we pretty much have a blank slate, though,
and I see no reason that it should extend to support arbitrary
encodings when limiting to UTF-8 creates a much simpler situation for
implementers and is already the unspoken standard.

There are display systems and platforms that fundamentally can't
display UTF-8 directly. For example, in the PC text modes I am limited
to IBM437. The problem of transcoding text/gemini should then lie with
the client authors for those platforms, not with every other client
author. ELinks for DOS will for example transcode UTF-8 (and various
other encodings) to IBM437 and use a placeholder character where no
equivalents exist.

-- 
Philip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201228/f376
8982/attachment.sig>

Link to individual message.

2. CΓ΄me Chilliet (come (a) chilliet.eu)

Le lundi 28 d?cembre 2020, 14:16:27 CET Philip Linde a ?crit :
> I understand the need for other document types to take other character
> encodings. For example, I have a collection of old text files in IBM437
> encoding. For text/gemini, we pretty much have a blank slate, though,
> and I see no reason that it should extend to support arbitrary
> encodings when limiting to UTF-8 creates a much simpler situation for
> implementers and is already the unspoken standard.

The main reason I see for authorizing other encodings is to be future proof.

When people designed old protocols they thought ASCII was here to stay.

So, I think we should learn from the past and not set in stone that all 
files must use utf-8, maybe something else will arise and be better for 
some unsuspected reason, and people will want to use that.

C?me

Link to individual message.

3. Solderpunk (solderpunk (a) posteo.net)

> Le lundi 28 d?cembre 2020, 14:16:27 CET Philip Linde a ?crit :

> > I understand the need for other document types to take other character
> > encodings. For example, I have a collection of old text files in IBM437
> > encoding. For text/gemini, we pretty much have a blank slate, though,
> > and I see no reason that it should extend to support arbitrary
> > encodings when limiting to UTF-8 creates a much simpler situation for
> > implementers and is already the unspoken standard.

The spec says that "Compliant clients MUST support UTF-8-encoded text/*
responses.  Clients MAY optionally support other encodings".  So, the
argument that we should make things simpler for implementers does not
really carry much weight here.  It's 100% okay to write a client which
(gracefully) refuses to handle any encoding other than UTF-8.  People
who want to serve text/gemini content with some other encoding can, but
they have no right to complain when only a subset (potentially a very
small one) of people can view said content.  This all seems fine to me.
Nobody is required or expected to support anything difficult or unusual,
but if some group of people all decide they want to do something
difficult or unusual for some strange reason, and they're willing to do
the work required, then nobody can tell them they're doing anything
wrong.

Cheers,
Solderpunk

Link to individual message.

4. Philip Linde (linde.philip (a) gmail.com)

On Mon, 28 Dec 2020 14:25:29 +0100
C?me Chilliet <come at chilliet.eu> wrote:

> The main reason I see for authorizing other encodings is to be future proof.
> 
> When people designed old protocols they thought ASCII was here to stay.

On the other hand, when people adopted Unicode 5.0, they did so fully
aware that there would likely be a Unicode 6.0, 7.0, 8.0 etc. Unicode
is future proof in the sense that it includes a process for updating
itself. The comparison to ASCII in that sense does not consider the
entirely different approaches these standards take. Where ASCII is
fixed and limited to a relatively tiny set of characters, Unicode is
deliberately open to amendment.

> So, I think we should learn from the past and not set in stone that all 
files must use utf-8, maybe something else will arise and be better for 
some unsuspected reason, and people will want to use that.

In such an event we can under the current spec either expect an
effective split of geminispace around which clients support what
encodings, or widespread client updates. This is not a very different
situation from a change to the fixed spec.

-- 
Philip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201228/fbca
2a56/attachment.sig>

Link to individual message.

5. Philip Linde (linde.philip (a) gmail.com)

On Mon, 28 Dec 2020 14:30:38 +0100
"Solderpunk" <solderpunk at posteo.net> wrote:

> The spec says that "Compliant clients MUST support UTF-8-encoded text/*
> responses.  Clients MAY optionally support other encodings".  So, the
> argument that we should make things simpler for implementers does not
> really carry much weight here.  It's 100% okay to write a client which
> (gracefully) refuses to handle any encoding other than UTF-8. 

I am not so interested in what is okay or not in the abstract. As a
client author, the ideal situation for me is that my client supports the
entire per-specification geminispace. The specification currently makes
this a much harder problem than it would be if text/gemini documents
were limited to UTF-8. In fact, it's an open-ended problem that's
subject to change (as new encodings are introduced) and interpretation
(concerning what sequence of bytes represents e.g. "=>" in a
particular encoding, or how to transliterate URI to ASCII or IRI to
UTF-8). Thankfully, geminispace seems to have settled on UTF-8, which is
why I think this is a good time to tie that end up.

> People
> who want to serve text/gemini content with some other encoding can, but
> they have no right to complain when only a subset (potentially a very
> small one) of people can view said content.  This all seems fine to me.
> Nobody is required or expected to support anything difficult or unusual,
> but if some group of people all decide they want to do something
> difficult or unusual for some strange reason, and they're willing to do
> the work required, then nobody can tell them they're doing anything
> wrong.

Perhaps there is a great argument for allowing other encodings that
makes this an acceptable outcome, but a hypothetical effective split of
geminispace around which encodings are used and which clients support
them doesn't sound desirable in itself.

We can turn the question around and instead ask what motivates the
inclusion of each supported encoding (or arbitrary encodings in
general, a simpler question).

-- 
Philip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201228/12e2
6723/attachment-0001.sig>

Link to individual message.

6. Philip Linde (linde.philip (a) gmail.com)

Note also, regarding the current discussion about IRI, that if IRI is
adopted and the community later adopts a different (UTF-8
incompatible) standard encoding, the way IRI is implemented has to
change, or we lose its benefits entirely when we implement a
galaxy-grade encoding in the year 2121 and still have to deal with
UTF-8 IRI.

Even if IRI is decided against, by requiring URI at all, we're betting
that this hypothetical future encoding will be ASCII-compatible, or
we're giving client authors the additional burden of transcoding.

-- 
Philip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201228/c7b8
a00b/attachment.sig>

Link to individual message.

7. Sean Conner (sean (a) conman.org)

It was thus said that the Great Solderpunk once stated:
> > Le lundi 28 d?cembre 2020, 14:16:27 CET Philip Linde a ?crit :
> 
> > > I understand the need for other document types to take other character
> > > encodings. For example, I have a collection of old text files in IBM437
> > > encoding. For text/gemini, we pretty much have a blank slate, though,
> > > and I see no reason that it should extend to support arbitrary
> > > encodings when limiting to UTF-8 creates a much simpler situation for
> > > implementers and is already the unspoken standard.
> 
> The spec says that "Compliant clients MUST support UTF-8-encoded text/*
> responses.  Clients MAY optionally support other encodings".  

  I would ammend that to read "Compliant clients MUST support UTF-8 and
US-ASCII encoded text/* reponses." This is because US-ASCII is a proper
subset of UTF-8, and any valid ASCII file is also a valid UTF-8 file.  I'm
thinking here of automated MIME detection (ala libmagic) that might return a
MIME type of 'text/plain; charset=us-ascii' for a text file.

  -spc

Link to individual message.

8. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 29, 2020, at 00:35, Sean Conner <sean at conman.org> wrote:
> 
>  I would ammend that to read "Compliant clients MUST support UTF-8 and
> US-ASCII encoded text/* reponses."

This is wholly redundant. UTF-8 is, by design, a /superset/ of US-ASCII. 

But the reverse is not true, obviously. Therefore the endless confusion.

Drop the US-ASCII holdovers. Embrace UTF-8. Move on.

My 2?.

Link to individual message.

9. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 29, 2020, at 09:00, Petite Abeille <petite.abeille at gmail.com> wrote:
> 
> Drop the US-ASCII holdovers. Embrace UTF-8. Move on.

If this is too much to swallow, then we have to phrase is the other way around:

"Clients MUST support US-ASCII, and SHOULD support UTF-8"

The same applies to the request URL, and text/gemini links.

Link to individual message.

10. Sean Conner (sean (a) conman.org)

It was thus said that the Great Petite Abeille once stated:
> > On Dec 29, 2020, at 00:35, Sean Conner <sean at conman.org> wrote:
> > 
> >  I would ammend that to read "Compliant clients MUST support UTF-8 and
> > US-ASCII encoded text/* reponses."
> 
> This is wholly redundant. UTF-8 is, by design, a /superset/ of US-ASCII. 

  Not it's not.  Here's the origial text:

> The spec says that "Compliant clients MUST support UTF-8-encoded text/*
> responses.  Clients MAY optionally support other encodings".

  Per this wording, any client that receives "text/plain; charset=us-ascii"
is allowed to just drop it on the floor and do absolutely nothing with it. 
Some here migh actually prefer that, but "text/plain; charset=us-ascii" is
also "text/plain; charset=utf-8", that is, a client *can* do something
meaningful with it, unlike "text/plain; charset=CSISOLATIN3".

> Drop the US-ASCII holdovers. Embrace UTF-8. Move on.

  Why do you hate textfiles.com?

> My 2?.

  -spc (my $.02)

Link to individual message.

11. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 29, 2020, at 10:03, Sean Conner <sean at conman.org> wrote:
> 
>  Per this wording, any client that receives "text/plain; charset=us-ascii"
> is allowed to just drop it on the floor and do absolutely nothing with it. 

Nonsense. A compliant client MUST support UTF-8. US-ASCII is a strict 
subset of UTF-8. Therefore a compliant client supports US-ASCII 
out-of-the-box.  Nothing more, and nothing less.

>  Why do you hate textfiles.com?

Haters gonna hate :P

Isn't it, like, 3am in your timezone? Go back to bed :)

Link to individual message.

12. Peter Vernigorov (pitr.vern (a) gmail.com)

On Tue, Dec 29, 2020 at 10:11 AM Petite Abeille
<petite.abeille at gmail.com> wrote:
> > On Dec 29, 2020, at 10:03, Sean Conner <sean at conman.org> wrote:
> >
> >  Per this wording, any client that receives "text/plain; charset=us-ascii"
> > is allowed to just drop it on the floor and do absolutely nothing with it.
> Nonsense. A compliant client MUST support UTF-8. US-ASCII is a strict 
subset of UTF-8. Therefore a compliant client supports US-ASCII 
out-of-the-box.  Nothing more, and nothing less.

A car contains people. Therefore people are cars.

Petite, you are confusing Is-A and Has-A relationships [1][2]. UTF-8
is a ("separate" from US-ASCII) character encoding that contains ASCII
charset. If the spec said "clients MUST support ONLY UTF-8" then any
pages specifying "charset=us-ascii" must result in an error.

[1] https://en.wikipedia.org/wiki/Is-a
[2] https://en.wikipedia.org/wiki/Has-a

Back to a more productive topic, the wording in the spec - "clients
MUST support UTF-8 encoded responses" - is ambiguous and doesn't
actually mean that acceptable value for "charset" must include
"utf-8", and says nothing about what values of "charset" are
acceptable. It says that clients must at the very least try to decode
response using UTF-8 charset decoder. Responses encoded with US-ASCII
and UTF-8 (and UTF-PETER, which is a random subset of UTF-8) will
indeed work.

Looking at latest stats on
gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi it looks like
UTF-8 (this includes unspecified charsets which per spec default to
UTF-8) is used by 81% of pages, US-ASCII accounts for 17%.

Given this, I suggest the spec be rephrased such that it instead
specifies minimum acceptable values of "charset" (specifically
us-ascii and utf-8).

Link to individual message.

13. Sean Conner (sean (a) conman.org)

It was thus said that the Great Peter Vernigorov once stated:
> 
> Looking at latest stats on
> gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi it looks like
> UTF-8 (this includes unspecified charsets which per spec default to
> UTF-8) is used by 81% of pages, US-ASCII accounts for 17%.
> 
> Given this, I suggest the spec be rephrased such that it instead
> specifies minimum acceptable values of "charset" (specifically
> us-ascii and utf-8).

  Agreed.  And looking at the stats from GUS [1], text/plain is more popular
than text/gemini (by over 2:1) and UTF-8 to US-ASCII is 54% to 46%.

  -spc

[1]	gemini://gus.guru/statistics

Link to individual message.

14. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 29, 2020, at 22:24, Peter Vernigorov <pitr.vern at gmail.com> wrote:
> 
> Back to a more productive topic, the wording in the spec - "clients
> MUST support UTF-8 encoded responses" - is ambiguous and doesn't
> actually mean that acceptable value for "charset" must include
> "utf-8", and says nothing about what values of "charset" are
> acceptable.

Confused indeed.

Are you making a distinction between UTF-8 the encoding vs. utf-8 the 
charset? Is there such difference? What would that difference be? I feel out of depth.

But ok, if that helps in term of clarity of purpose, then more power to 
the spec by spelling it out.

Link to individual message.

15. Alex // nytpu (alex (a) nytpu.com)

> looking at the stats from GUS [1], text/plain is more popular than
> text/gemini (by over 2:1)
If I remember correctly, most of that plaintext is from an RFC mirror
somewhere, I'm sure the actual stats of stuff written and intended for
gemini, not including mirrored content, differs significantly from GUS'
report.

-- 
Alex // nytpu
alex at nytpu.com
GPG Key: https://www.nytpu.com/files/pubkey.asc
Key fingerprint: 43A5 890C EE85 EA1F 8C88 9492 ECCD C07B 337B 8F5B
https://useplaintext.email/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201229/7568
1097/attachment.sig>

Link to individual message.

16. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 29, 2020, at 22:24, Peter Vernigorov <pitr.vern at gmail.com> wrote:
> 
> Looking at latest stats on
> gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi it looks like
> UTF-8 (this includes unspecified charsets which per spec default to
> UTF-8) is used by 81% of pages, US-ASCII accounts for 17%.

The actual numbers are as follow:

? Unspecified: 39628
? us-ascii: 9995
? utf-8: 7090
( 56,713 total)

It's not clear if this pertain to the 36,477 text/gemini documents only, 
or the entire dataset (57,164 url vs. 56,713 encodings. 451 MIA).

Looking at the numbers I guess it covers the entire data set as there are 
more 'Unspecified' than 'text/gemini' to start with.

I'm not sure what these numbers mean at all, but they are not describing text/gemini.

Not sure why we would draw any conclusion from them in regards to  text/gemini.

Link to individual message.

17. Peter Vernigorov (pitr.vern (a) gmail.com)

On Wed, Dec 30, 2020 at 00:04 Petite Abeille <petite.abeille at gmail.com>
wrote:

>
>
> > On Dec 29, 2020, at 22:24, Peter Vernigorov <pitr.vern at gmail.com> wrote:
> >
> > Looking at latest stats on
> > gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi it looks like
> > UTF-8 (this includes unspecified charsets which per spec default to
> > UTF-8) is used by 81% of pages, US-ASCII accounts for 17%.
>
> The actual numbers are as follow:
>
> ? Unspecified: 39628
> ? us-ascii: 9995
> ? utf-8: 7090
> ( 56,713 total)
>
> It's not clear if this pertain to the 36,477 text/gemini documents only,
> or the entire dataset (57,164 url vs. 56,713 encodings. 451 MIA).
>

Could you clarify which part is unclear to you here? 56,713 is, by design,
a strict /superset/ of 36k :)


> Looking at the numbers I guess it covers the entire data set as there are
> more 'Unspecified' than 'text/gemini' to start with.
>
> I'm not sure what these numbers mean at all, but they are not describing
> text/gemini.
>
> Not sure why we would draw any conclusion from them in regards to
> text/gemini.
>

While it?s true that the thread subject mentions text/gemini, the oft
quoted part of the spec is in section ?3.3 Response bodies? and talks about
any text/* responses. The only mention of charset in section 5 (which
describes text/gemini) is a reference to 3.3. Also, looking at stats of
either entire dataset or only text/gemini shows the same picture: utf-8 and
us-ascii account for ~99% of all charset values.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20201230/ea7d
ac64/attachment.htm>

Link to individual message.

18. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 30, 2020, at 03:31, Peter Vernigorov <pitr.vern at gmail.com> wrote:
> 
> Could you clarify which part is unclear to you here? 56,713 is, by 
design, a strict /superset/ of 36k :)
> 

Math is hard, let's go shopping.

I'm sure it will all make sense at the very, very end.

Link to individual message.

19. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 30, 2020, at 03:31, Peter Vernigorov <pitr.vern at gmail.com> wrote:
> 
> , the oft quoted part of the spec is in section ?3.3 Response bodies? 
and talks about any text/* responses.

Ohhhh... right you are, I was assuming text/gemini only. My bad.

This sounds like a major overreach. Shouldn't Gemini restrict itself to 
just text/gemini as far as the Gemini spec goes?

On what ground would Gemini redefine what a text content type is?

A MIME content type of "text" is "text/plain; charset=us-ascii" by default. 

I don't quite see the point of redefining how a major piece of MIME is 
defined. Sounds counterproductive.

Link to individual message.

20. Petite Abeille (petite.abeille (a) gmail.com)



> On Dec 30, 2020, at 03:31, Peter Vernigorov <pitr.vern at gmail.com> wrote:
> 
>  ?3.3 Response bodies? and talks about any text/* responses.

Actually, this is counterproductive, and wrong, technically speaking.

Consider the following response:

20 text/html
...

While HTML5 is UTF-8 by default, most vintage html is ISO-8859-1. And 
there is a lot of vintage to go around.

Defaulting to UTF-8 for all of text/* at large would break the interweb as we know it.

Why take on such burden? 

Perhaps best to narrow the spec to only speak about text/gemini. 

Other text/* media types have their own idiosyncrasies. Best to leave them alone.

?2?

Link to individual message.

21. Stephane Bortzmeyer (stephane (a) sources.org)

On Tue, Dec 29, 2020 at 04:37:08PM -0500,
 Sean Conner <sean at conman.org> wrote 
 a message of 17 lines which said:

> > Looking at latest stats on
> > <gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi> it looks
> > like UTF-8 (this includes unspecified charsets which per spec
> > default to UTF-8) is used by 81% of pages, US-ASCII accounts for
> > 17%.

> And looking at the stats from GUS [1], text/plain is more popular
> than text/gemini (by over 2:1) and UTF-8 to US-ASCII is 54% to 46%.

This is because it includes a lot of text/plain. I've just modified
the stats at <gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi>
to have a special tally for text/gemini and UTF-8 has a
quasi-monopoly.

Link to individual message.

22. Stephane Bortzmeyer (stephane (a) sources.org)

On Mon, Dec 28, 2020 at 02:16:27PM +0100,
 Philip Linde <linde.philip at gmail.com> wrote 
 a message of 69 lines which said:

> While it is the case that impact is minimal, I suggest that the
> specification reflects the much simpler situation these statistics
> indicate rather than keep itself open to the general problem of
> representing text/gemini in encodings that might not even have the
> meta information characters encoded in the same way, and?if IRIs are
> introduced?creates the problem of how IRIs should be represented in
> e.g. ISO-8859-1.

Note also that saying "gemtexts MUST be in UTF-8" is not
everything. We may (or may be not) also want to mandate end-of-lines
(they can be represented with CR, LF, CR-LF, LS or PS, the last two
being purely Unicode, not present in ASCII) and normalization.

If we go that way, there is an existing standard for Unicode text, RFC
5198 <gemini://gemini.bortzmeyer.org/rfc-mirror/rfc5198.txt>. It
mandates CR-LF and normalization NFC.

Link to individual message.

23. Petite Abeille (petite.abeille (a) gmail.com)



> On Jan 3, 2021, at 14:49, Stephane Bortzmeyer <stephane at sources.org> wrote:
> 
> It mandates CR-LF and normalization NFC.

RFC5198. Yes. Normalization, normalization.

? ???

Link to individual message.

24. Petite Abeille (petite.abeille (a) gmail.com)



> On Jan 3, 2021, at 14:46, Stephane Bortzmeyer <stephane at sources.org> wrote:
> 
> UTF-8 has a quasi-monopoly.

Not quite.

For text/gemini, your stats read:

? Unspecified: 42,322
? utf-8: 6,513
? us-ascii: 3

Unspecified rules. By far. Most likely plain ASCII in practice.


Could you run #file --mime-type --mime-encoding on all these text/gemini? 

$ openssl s_client -quiet -crlf -connect mozz.us:1965 <<< 
gemini://mozz.us/ 2>/dev/null | file --brief --mime-type --mime-encoding -
text/plain; charset=utf-8


Validating the encoding would be informative as well:

$ openssl s_client -quiet -crlf -connect mozz.us:1965 <<< 
gemini://mozz.us/ 2>/dev/null | iconv -f utf-8 -t utf-8 > /dev/null; echo $?
0


Ditto for guessing the actual language:

# echo $(openssl s_client -quiet -crlf -connect mozz.us:1965 <<< 
gemini://mozz.us/ 2>/dev/null ) | polyglot detect | cut -d' ' -f1 | uniq
English

https://polyglot.readthedocs.io/en/latest/Detection.html


? ???

Link to individual message.

25. CΓ΄me Chilliet (come (a) chilliet.eu)

Le dimanche 3 janvier 2021, 17:02:54 CET Petite Abeille a ?crit :
> > On Jan 3, 2021, at 14:46, Stephane Bortzmeyer <stephane at sources.org> wrote:
> > UTF-8 has a quasi-monopoly.
> 
> Not quite.
> 
> For text/gemini, your stats read:
> 
> ? Unspecified: 42,322
> ? utf-8: 6,513
> ? us-ascii: 3
> 
> Unspecified rules. By far. Most likely plain ASCII in practice.

No, the specification specifies that default is utf-8, so unspecified is utf-8.
I do not set the charset in my server headers as it is redundant because I 
always send utf-8.

> Ditto for guessing the actual language:
> 
> # echo $(openssl s_client -quiet -crlf -connect mozz.us:1965 <<< 
gemini://mozz.us/ 2>/dev/null ) | polyglot detect | cut -d' ' -f1 | uniq
> English
> 
> https://polyglot.readthedocs.io/en/latest/Detection.html

Language is not the same, because the specification explicitely says that 
there is no default, so my server always send the lang= header tag for 
text/gemini content.

C?me

Link to individual message.

26. Petite Abeille (petite.abeille (a) gmail.com)



> On Jan 3, 2021, at 17:11, C?me Chilliet <come at chilliet.eu> wrote:
> 
> No, the specification specifies that default is utf-8, so unspecified is utf-8.

Precisely my point. Thanks.

? ???

Link to individual message.

27. Petite Abeille (petite.abeille (a) gmail.com)



> On Jan 3, 2021, at 17:11, C?me Chilliet <come at chilliet.eu> wrote:
> 
> Language is not the same, because the specification explicitely says 
that there is no default, so my server always send the lang= header tag 
for text/gemini content.

No one said it was the same. But it would be interesting to know.

? ???

Link to individual message.

28. Petite Abeille (petite.abeille (a) gmail.com)



> On Jan 3, 2021, at 17:11, C?me Chilliet <come at chilliet.eu> wrote:
> 
> I do not set the charset in my server headers as it is redundant because 
I always send utf-8.

What's your server? We can validate that promptly.

? ???

Link to individual message.

29. CΓ΄me Chilliet (come (a) chilliet.eu)

Le dimanche 3 janvier 2021, 17:13:10 CET Petite Abeille a ?crit :
> 
> > On Jan 3, 2021, at 17:11, C?me Chilliet <come at chilliet.eu> wrote:
> > 
> > No, the specification specifies that default is utf-8, so unspecified is utf-8.
> 
> Precisely my point. Thanks.

No, you said ?Not quite.? when Stephane said UTF-8 had quasi-monopoly, and 
you rectified ?Unspecified rules?.
Unspecified is UTF-8, by specification. So UTF-8 does have a quasi-monopoly.

> No one said it was the same. But it would be interesting to know.

The fact that you made a difference between unspecified and utf-8 for 
encoding, and that you provided both tools to detect encoding and language 
made it seem like you were considering both cases the same, while encoding 
is always known (unspecified == utf-8) while language is sometimes unknown.
I also explicitely pointed this difference to encourage people to specify 
the language in their Gemini headers, as a lot of Gemini pages currently 
have an unspecified language.

> What's your server? We can validate that promptly.

Please stop splitting your answers in several emails like this, it?s 
uselessly filling the mailing list and making discussions harder to follow.
I do not need to validate that, I know that my server header does not 
contain a charset tag.

C?me

Link to individual message.

---

Previous Thread: [user] [ot] flame warriors: a full taxonomy

Next Thread: [user] new capsule