Proposed minor spec changes, for comment.

solderpunk <solderpunk (a) SDF.ORG>

Ahoy!

The three month spec freeze announced, well, almost three months ago,
will be expiring soon.  Things to ponder/discuss have been piling
up.  So, I've been considering dealing with some of the "low hanging
fruit" early (I have some time off work later this week because of a
national holiday).  I'm thinking in particular of fairly minor
changes, where it is obvious that there is a problem in what's already
specced or important functionality is missing, and where there are
fairly obviously solutions.

To this end, I'm going to outline some proposals below for feedback.
I *hope* that these will be pretty uncontroversial.  Feedback is
welcome, as always, but we have to do *something* about these issues,
so if you really think what I propose below is a bad idea, a better
alternative would be a very good thing to bring to the discussion!

Here we go, then...

ISSUE 1:

Problem: The current spec does not impose any limit on request header
length.  The status code and META field can be separated by
arbitrarily many spaces and/or tabs.  Malicious or buggy servers can
hang or crash carelessly written clients by sending an infinite stream
of whitespace.  It's not clear *why* anybody would want to do this (a
"reverse DOS attack" is not very useful!), but it's clearly a problem
nevertheless.

Proposal: Redfine response headers from:

<STATUS><whitespace><META><CR><LF>

to:

<STATUS> <META><CR><LF>

i.e. exactly one space character between <STATUS> and <META>

Rationale: Allowing multiple whitespace characters of different kinds
makes sense in, e.g., the link syntax of text/gemini - that has to be
written and read by human content authors, so it's a good idea to
accommodate different editor behaviours and different personal
preferences for laying things out.  But response headers are written
and read by software, so there's no need to be so generous.
Specifying the header format more precisely actually just makes life
slightly easier for client authors.  As a result of this, the maximum
length of a response becomes finite (as the length of <STATUS> and
<META> are already well defined elsewhere).

Client authors who want to follow Postel's law won't need to make any
changes here.  I imagine many server authors also won't actually need
to.  The most probable scenario is no change needed (the server already
sends one space) or a single s/\t/ / is neeed.

ISSUE 2:

Problem: The spec makes a big fuss about how text/gemini is
line-oriented, but does not clearly state what exactly constitutes a
line.  The definition of link lines includes a <CR><LF> at the end but
it's not clear if that applies to all line types - or whether I even
meant to do this or it was a careless error.

Proposal: Actually, it turns out this is decided for us.  RFC2046,
which defines the text/* MIME media type and the text/plain subtype
covers this very clearly:

---
4.1.1.  Representation of Line Breaks

   The canonical form of any MIME "text" subtype MUST always represent a
   line break as a CRLF sequence.  Similarly, any occurrence of CRLF in
   MIME "text" MUST represent a line break.  Use of CR and LF outside of
   line break sequences is also forbidden.

   This rule applies regardless of format or character set or sets
   involved.
---

Since text/gemini is, well, text/gemini, it is a "text" subtype and
using anything other than CRLF means we're violating the RFCs we're
supposedly building on top of.

So, CRLF everywhere it is.

I propose it be mostly the server's job to handle this.  Text editors
on different operating systems used by content authors will use
various different line break encodings which are beyond our control,
so we can't really make it the author's job.  Servers can translate LF
to CRLF before sending content over the network.  This way clients
only need to handle the "canonical" format, no matter what authors do.

Rationale: Don't break foundational RFCs.

Yeah, I know, this is tedious and no fun for server authors, but, well,
see above.

ISSUE 3:

Problem: There's no way to specify the (human) language a text/gemini document
is written in.

Proposal: Define a new parameter for the text/gemini MIME type
(alongside the previously defined `charset`) to specify language.
Following the example set by HTML, it seems natural to call the
parameter `lang` and to allow values as per RFC1766, e.g.:

text/gemini; charset=utf-8; lang=en
text/gemini; charset=utf-8; lang=en-US
text/gemini; charset=utf-8; lang=en-GB
text/gemini; charset=utf-8; lang=es
text/gemini; charset=utf-8; lang=fr

Rationale: A protocol for a global network which targets human beings
reading textual content as its first-class application shouldn't be
Anglocentric!  Gemini already has:


  without the ability to limit searches to target languages.

This looks a bit scary at first from an extensibility point of view,
because it does kind of open the door to defining all sorts of
additional parameters.  However, the pre-exisiting MIME RFCs we're
leveraging here make it pretty clear that (i) these things aren't
open-ended, each MIME type and subtype has a fixed and finite set of
defined parameters, and (ii) that only certain kinds of semantic
information are really appropriate here.  So this is about as safe as
extensibility gets.

ISSUE 4:

Problem: Name-based virtual hosting is explicitly described as being
supported in the spec, but no mention is made of SNI (Server Name
Indication, a TLS extension which puts the desired server hostname in
the TLS handshake).  Without this, virtual hosting can't be made to
work reliably.

Proposal: Mandate use of SNI by clients.

Rationale: Earlier I proposed speccing that clients SHOULD use SNI but
requiring that servers be robust against its absence, by assigning a
default hostname.  Upon more thought, this won't work.  I was thinking
about how a missing Host: header was handled in this situation in
HTTP, where a default host works just fine.  But with TLS involved,
this is a problem: if the default host is not the one the client has
actually requested, the default certificate's Common Name and Subject
Alternative Names won't match what the client expects and the
certificiate will be rejected.  So, I think we just have to require
SNI.

If you're a client developer, please check whether or not the TLS
library you are using supports SNI!  If not, let me know.  I imagine
in this day and age they all will, so this won't be a burdensome
requirement.

That's it!

Cheers,
Solderpunk

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Mon, May 18, 2020 at 08:35:44PM +0000, solderpunk wrote:

Whoops, small mistake:

> ISSUE 1:
> 
> Problem: The current spec does not impose any limit on request header
> length.

This should say "response header", not "request header".

Cheers,
Solderpunk

Link to individual message.

colecmac@protonmail.com <colecmac (a) protonmail.com>

These are good additions, thanks.

I take issue with, maybe predictably, number 2. I didn't realize
that RFC existed, and I was surprised to read it. Are there any
other formats that actually follow this? Take markdown for
example, plenty of people write markdown with only a \n at the
end of each line, and it doesn't affect anything. In fact, in the
markdown spec, it explicitly says just \n is fine:
https://spec.commonmark.org/0.29/#line-ending

I don't see the point in following the RFC in this case, I think
it adds needless complication to the spec. Obviously it's not
such a big change to server software, but I don't think there's a
good reason to add it.

> Rationale: Don't break foundational RFCs.

It would be hard to argue with this, except that there seems to be
a precedent of not caring about this specific part of this RFC.
I think Gemini would be less surprising and more simple, if it
disregarded this like most other specs seem to do.

I'm in favor of defining a line ending as \r\n OR \n.

Thoughts?

makeworld

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great solderpunk once stated:
> ISSUE 3:
> 
> Problem: There's no way to specify the (human) language a text/gemini document
> is written in.
> 
> Proposal: Define a new parameter for the text/gemini MIME type
> (alongside the previously defined `charset`) to specify language.
> Following the example set by HTML, it seems natural to call the
> parameter `lang` and to allow values as per RFC1766, e.g.:
> 
> text/gemini; charset=utf-8; lang=en
> text/gemini; charset=utf-8; lang=en-US
> text/gemini; charset=utf-8; lang=en-GB
> text/gemini; charset=utf-8; lang=es
> text/gemini; charset=utf-8; lang=fr

  What's a client to do if 'lang=' isn't there?  Assume English?  Assume
nothing?

  -spc

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Mon, May 18, 2020 at 05:03:41PM -0400, Sean Conner wrote:
 
>   What's a client to do if 'lang=' isn't there?  Assume English?  Assume
> nothing?

Good question.  Should clients which do something with this information
(like screenreaders) have a default language setting which users can
set?

Cheers,
Solderpunk

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Mon, May 18, 2020 at 05:03:41PM -0400, Sean Conner wrote:
 
>   What's a client to do if 'lang=' isn't there?  Assume English?  Assume
> nothing?

Specifying English, or any other language, as a default value doesn't
seem right to me.

Realistically, the majority of clients will not do anything with this
information.  Those which *do* use it in some way will know best what
the most sensible behaviour is in the absence of explicit information.
So I propose that client behaviour in the absence of a specified
language is up to the client.

The search engine case seems the trickiest to me (compared to, say,
screen reading).  Statistical language recognition is a thing, but doing
it on every document in Geminispace is likely to be burdensome...

Cheers,
Solderpunk

Link to individual message.

jan6@tilde.ninja <jan6 (a) tilde.ninja>

On Mon, May 18, 2020 at 05:03:41PM -0400, Sean Conner wrote:
> What's a client to do if 'lang=' isn't there? Assume English? Assume nothing?

I'd think only the mimetype should be mandatory, and the rest will use defaults, when not
specified...
of course, spec shouldn't specify what the defaults are...

it could also attempt to auto-detect and prompt user if it matters (normal 
text browsers will
probably be indifferent, but audio browser could ask, and search engines 
could warn, which will
incentivize users to put a language anyway), but that's a client-specific extra...

I'm not sure I see the point in the encoding part, though...
practically everything can be converted to utf8 rather easily, making it a bit useless to
specify...

another interesting point, what specification is the lang= tag?
it should probably encouraged to use some special use codes too, taking 
ISO 639-2 as example
(standard specifying three-letter codes for languages):
mis, for "uncoded languages";
mul, for "multiple languages";
und, for "undetermined";
zxx, for "no linguistic content; not applicable";

where, AFAIK, "mis" would apply for languages not in the spec, probably 
stuff like Toki Pona
"mul" wouldn't be that useful on its own, but maybe when "mul" is 
specified and there are specified
multiple languages in addition, it could help?
"zxx" would apply for art, because if you happen to have an ascii art 
gallery or so, there's no
point in indexing it fully (you can just have a list of all the names on 
another page, to get them
listed), and no point in reading it aloud...

there's probably a somewhat similar part of whatever the spec is, or similar convention

Link to individual message.

Steve Ryan <stryan (a) saintnet.tech>

On 20/05/18 08:35PM, solderpunk wrote:
> ISSUE 2:
> 
> Problem: The spec makes a big fuss about how text/gemini is
> line-oriented, but does not clearly state what exactly constitutes a
> line.  The definition of link lines includes a <CR><LF> at the end but
> it's not clear if that applies to all line types - or whether I even
> meant to do this or it was a careless error.
> 
> Proposal: Actually, it turns out this is decided for us.  RFC2046,
> which defines the text/* MIME media type and the text/plain subtype
> covers this very clearly:
> 
> ---
> 4.1.1.  Representation of Line Breaks
> 
>    The canonical form of any MIME "text" subtype MUST always represent a
>    line break as a CRLF sequence.  Similarly, any occurrence of CRLF in
>    MIME "text" MUST represent a line break.  Use of CR and LF outside of
>    line break sequences is also forbidden.
> 
>    This rule applies regardless of format or character set or sets
>    involved.
> ---
> 
> Since text/gemini is, well, text/gemini, it is a "text" subtype and
> using anything other than CRLF means we're violating the RFCs we're
> supposedly building on top of.
> 
> So, CRLF everywhere it is.
> 
> I propose it be mostly the server's job to handle this.  Text editors
> on different operating systems used by content authors will use
> various different line break encodings which are beyond our control,
> so we can't really make it the author's job.  Servers can translate LF
> to CRLF before sending content over the network.  This way clients
> only need to handle the "canonical" format, no matter what authors do.
> 
> Rationale: Don't break foundational RFCs.
> 
> Yeah, I know, this is tedious and no fun for server authors, but, well,
> see above.

My only concern with this is the "server's job" part. I'd rather not
have my server transform user-supplied content, even if it's something
as minor as line breaks. Apache doesn't attempt to fix invalid HTML, why
should SecretShop fix invalid text/gemini? Seems to me this should be
handled by something like the gemini vim-syntax plugin.

It also makes writing servers a bit more complicated since text/gemini
has to be treated differently from other files and actually parsed
versus being directly served up. Not the biggest deal (and you've
already admited it's tedious) but just something I noticed.

> 
> ISSUE 4:
> 
> Problem: Name-based virtual hosting is explicitly described as being
> supported in the spec, but no mention is made of SNI (Server Name
> Indication, a TLS extension which puts the desired server hostname in
> the TLS handshake).  Without this, virtual hosting can't be made to
> work reliably.
> 
> Proposal: Mandate use of SNI by clients.
> 

SecretShop implements virtual-hosting with the assumption that clients
are using SNI, so I'm in favour.

-Steve

Link to individual message.

jan6@tilde.ninja <jan6 (a) tilde.ninja>

May 19, 2020 12:51 AM, "Steve Ryan" <stryan at saintnet.tech> wrote:
> My only concern with this is the "server's job" part. I'd rather not
> have my server transform user-supplied content, even if it's something
> as minor as line breaks. Apache doesn't attempt to fix invalid HTML, why
> should SecretShop fix invalid text/gemini? Seems to me this should be
> handled by something like the gemini vim-syntax plugin.
> 
> It also makes writing servers a bit more complicated since text/gemini
> has to be treated differently from other files and actually parsed
> versus being directly served up. Not the biggest deal (and you've
> already admited it's tedious) but just something I noticed.

well, teeeechnically it'd have to do this for all text/* files, according 
to the rfc, including
plaintext and whatnot... although THAT's unlikely to happen...

I'd also be in favor of server handling it, although that is a kinda-valid point...
html doesn't count because the browsers will do their best to fix all 
kinds of TOTALLY BROKEN html,
you can have partial tags, no end tags for some, etc, and the browser will 
never tell the user your
site is a trashpile, it will just silently try its best to fix it up (even 
in inspect element,
you'd see the "repaired" version)...

Link to individual message.

epoch <epoch (a) enzo.thebackupbox.net>

On Mon, May 18, 2020 at 10:00:56PM +0000, jan6 at tilde.ninja wrote:
> May 19, 2020 12:51 AM, "Steve Ryan" <stryan at saintnet.tech> wrote:
> > My only concern with this is the "server's job" part. I'd rather not
> > have my server transform user-supplied content, even if it's something
> > as minor as line breaks. Apache doesn't attempt to fix invalid HTML, why
> > should SecretShop fix invalid text/gemini? Seems to me this should be
> > handled by something like the gemini vim-syntax plugin.
> > 
> > It also makes writing servers a bit more complicated since text/gemini
> > has to be treated differently from other files and actually parsed
> > versus being directly served up. Not the biggest deal (and you've
> > already admited it's tedious) but just something I noticed.
> 
> well, teeeechnically it'd have to do this for all text/* files, 
according to the rfc, including
> plaintext and whatnot... although THAT's unlikely to happen...
> 
> I'd also be in favor of server handling it, although that is a kinda-valid point...
> html doesn't count because the browsers will do their best to fix all 
kinds of TOTALLY BROKEN html,
> you can have partial tags, no end tags for some, etc, and the browser 
will never tell the user your
> site is a trashpile, it will just silently try its best to fix it up 
(even in inspect element,
> you'd see the "repaired" version)...

https://tools.ietf.org/html/rfc5147#section-4.1

"   In Internet messages, line endings in text/plain MIME entities are
   represented by CR+LF character sequences (see RFC 2046 [3] and RFC
   3676 [6]).  However, some protocols (such as HTTP) additionally allow
   other conventions for line endings."

precedent of protocols allowing for other line endings, so the gemini spec
could just say the same thing that http does about line endings.

https://tools.ietf.org/html/rfc2616#section-3.7.1

"   When in canonical form, media subtypes of the "text" type use CRLF as
   the text line break. HTTP relaxes this requirement and allows the
   transport of text media with plain CR or LF alone representing a line
   break when it is done consistently for an entire entity-body. HTTP
   applications MUST accept CRLF, bare CR, and bare LF as being
   representative of a line break in text media received via HTTP."

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great jan6 at tilde.ninja once stated:
> On Mon, May 18, 2020 at 05:03:41PM -0400, Sean Conner wrote:
> > What's a client to do if 'lang=' isn't there? Assume English? Assume nothing?
> 
> I'd think only the mimetype should be mandatory, and the rest will use 
defaults, when not
> specified...
> of course, spec shouldn't specify what the defaults are...
> 
> it could also attempt to auto-detect and prompt user if it matters 
(normal text browsers will
> probably be indifferent, but audio browser could ask, and search engines 
could warn, which will
> incentivize users to put a language anyway), but that's a client-specific extra...

  I thought about autodetection---Unicode is defined in blocks, where each
alphabet becomes a defined block in Unicode.  I then realized that there are
multiple languages that use the European block.  Sure, detecting Greek is
easy since they have their own alphabet, but what about Spanish, French and
German?  They use the same alphabet.

  Nice idea, but there are some tough issues to address.

> I'm not sure I see the point in the encoding part, though...
> practically everything can be converted to utf8 rather easily, making it 
a bit useless to
> specify...

  Think legacy documents.  And not every legacy encoding scheme can round
trip through Unicode---I recall there being issues with several east Asian
languages (Chinese, Japanese in particular).

> another interesting point, what specification is the lang= tag?

  Solderpunk mentioned RFC-1766, which uses the two letter standard for
languages.

> it should probably encouraged to use some special use codes too, taking 
ISO 639-2 as example
> (standard specifying three-letter codes for languages):
> mis, for "uncoded languages";
> mul, for "multiple languages";
> und, for "undetermined";
> zxx, for "no linguistic content; not applicable";

  I buy that.

  -spc

Link to individual message.

kaoD <elkaod (a) gmail.com>

Glad to hear about SNI being in the radar. It's a must for virtual hosting.

Any thoughts about SNI interaction with the current "host in request URL is
like Host header" in the spec? Since SNI does the virtual hosting part (and
better) it would only be useful for proxying other hosts AFAICT.

Is proxying allowed currently in any server? Is it even desirable in the
protocol? Or is it just an idea that ossified in the spec without real
world use? (Genuine questions! I don't see the use but in sure it's been
discussed and I'm just late to the party.)

Most servers (all I've tried, circumlunar.space included) fail to handle
host-less requests (out of spec) and deny proxying other hosts.

And I'm pretty sure clients are adapting to this behavior. I'm afraid this
will end up being the de facto standard even with SNI making it obsolete.

Cheers,
kaoD

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great kaoD once stated:
> Glad to hear about SNI being in the radar. It's a must for virtual hosting.

  Yup.  At least two servers that I am aware of implement it, GLV-1.12556 (I
wrote this one) and gemserv.

> Any thoughts about SNI interaction with the current "host in request URL is
> like Host header" in the spec? 

  It's how GLV-1.12556 determines what set of content handlers to look
through when serving up a request.  The public server I have doesn't serve
multiple hosts, but the code running it can.

> Since SNI does the virtual hosting part (and
> better) it would only be useful for proxying other hosts AFAICT.
> 
> Is proxying allowed currently in any server? 

  I saw something about gemserv supporting proxying, but I don't know the
details of how it works.

  As for proxy support in GLV-1.12556 (the only server I can speak
authoritatively about), it would be easy to write a handler to support a
proxy like:

	gemini://gemini.conman.org/proxy/mozz.us/journal/2020-05-06.gmi

  But if by "proxy" you mean you connect to gemini.conman.org and expect the
request itself to be proxied:

		gemini://mozz.us/journal/2020-05-06.gmi

that ... could be done, but it would require two things---1) your client
would have to know to use the gemini.conman.org certificate to connect to my
server and 2) my server would have to know to proxy this domain (and
supporting that type of proxy in GLV-1.12556 would require some
thought---the server isn't set up for that type of thing [1]).

> Is it even desirable in the
> protocol? Or is it just an idea that ossified in the spec without real
> world use? (Genuine questions! I don't see the use but in sure it's been
> discussed and I'm just late to the party.)

  When the suggestion to use the URL as the request (which would give us
multidomain support with a server), solderpunk also saw a proxy being easy
to implement without thinking about the implications.  

> Most servers (all I've tried, circumlunar.space included) fail to handle
> host-less requests (out of spec) and deny proxying other hosts.
> 
> And I'm pretty sure clients are adapting to this behavior. I'm afraid this
> will end up being the de facto standard even with SNI making it obsolete.

  Huh?  I don't understand the concern here.

  -spc
[1]	Nor does it support multiple domains with a single certificate/key
	pair.  Right now, each server requires its own certificate/key file.

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great Sean Conner once stated:

>   But if by "proxy" you mean you connect to gemini.conman.org and expect the
> request itself to be proxied:
> 
> 		gemini://mozz.us/journal/2020-05-06.gmi
> 
> that ... could be done, but it would require two things---1) your client
> would have to know to use the gemini.conman.org certificate to connect to my
> server and 2) my server would have to know to proxy this domain (and
> supporting that type of proxy in GLV-1.12556 would require some
> thought---the server isn't set up for that type of thing [1]).

  I just thought of a way it could be done---I could have a copy of the
certificate/key pair for the domain I'm proxying for.  Then it would be easy
to write a handler to forward the request to the original domain and return
the contents.

  -spc

> [1]	Nor does it support multiple domains with a single certificate/key
> 	pair.  Right now, each server requires its own certificate/key file.

Link to individual message.

int 80h <int (a) 80h.dev>

I was also wondering about how autodetection would work. Instead of
autodetection how about having the language be on the first line in a
file and have the server parse that for the information? Sorry if that
was already discussed.

int 80

Link to individual message.

int 80h <int (a) 80h.dev>

On Mon, May 18, 2020 at 9:04 PM kaoD <elkaod at gmail.com> wrote:
> Is proxying allowed currently in any server? Is it even desirable in the 
protocol? Or is it just an idea that ossified in the spec without real 
world use? (Genuine questions! I don't see the use but in sure it's been 
discussed and I'm just late to the party.)

Gemserv has reverse proxy support. That's what's being used to proxy
gemini to gopher on my site.

int 80h

Link to individual message.

colecmac@protonmail.com <colecmac (a) protonmail.com>

> > Glad to hear about SNI being in the radar. It's a must for virtual hosting.

>   Yup.  At least two servers that I am aware of implement it, GLV-1.12556 (I
wrote this one) and gemserv.

Jetforce does as well, I'm not sure about others.

makeworld

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Mon, May 18, 2020 at 09:45:35PM +0000, jan6 at tilde.ninja wrote:
 
> I'm not sure I see the point in the encoding part, though...
> practically everything can be converted to utf8 rather easily, making it 
a bit useless to
> specify...

That's been in the spec since early days, the only proposed change here
is the language part.  The charset parameter is explicitly optional and
the default in its absence is UTF-8, which clients MUST support.
Support for additional encodings in clients is optional, but they should
fail gracefully in the face of things they can't handle.

I expect 90%+ of content to be UTF-8 and for servers not to bother
specifying this as it's the explicit default, but it seemed unwise to be
completely incapable of handing anything else.

Cheers,
Solderpunk

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Mon, May 18, 2020 at 08:07:53PM -0400, Sean Conner wrote:

> Sure, detecting Greek is
> easy since they have their own alphabet, but what about Spanish, French and
> German?  They use the same alphabet.

I don't think it's viable for interactive user clients (especially light
and simple ones) to attempt this, but in the context of, say, a search
engine which really wants to categorise everything (which is not to say
that GUS necessarily has to shoulder this burden!), even distinguishing
languages with the same alphabet is possible by looking at bigram and
trigram frequencies if there's enough text.  German text will have many
more occurences of "lich" and "heit" than French or Spanish, etc.
 
>   Nice idea, but there are some tough issues to address.

Yeah, this language proposal may have been poorly categorised as "quick
and easy" compared to the others.

Cheers,
Solderpunk

Link to individual message.

plugd <plugd (a) thelambdalab.xyz>


Sean Conner writes:
>   I thought about autodetection---Unicode is defined in blocks, where each

Please, _please_ don't require autodetection of lanugage or anything
else.  Gopher required autodetection of charsets and content type, which
was horrible.  One of the best things about the current protocol is that
it makes these things explicit.

Tim

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Tue, May 19, 2020 at 03:04:25AM +0200, kaoD wrote:

> Is proxying allowed currently in any server? Is it even desirable in the
> protocol? Or is it just an idea that ossified in the spec without real
> world use? (Genuine questions! I don't see the use but in sure it's been
> discussed and I'm just late to the party.)

I used the proxying capabilities of Gemini to write a Gopher-to-Gemini
proxy named Agena:

https://tildegit.org/solderpunk/agena

It is basically a Gemini server, which when it receives requests for
gopher:// URLs, downloads the content from Gopher, and, in the case of
Gopher menus, converts them to text/gemini menus and replies with that,
and in the case of anything else does MIME type and (for text/*) charset
detection, and replies with that with the appropriate header.

AV-98 lets you specify the host and port of such a proxy and if you try
to follow a link into Gopherspace, it routes it through proxy.

This makes it possible to seamlessly surf Geminispace and Gopherspace
from within the same client with a consistent interface without
requiring client authors to go through the trouble of adding support for
both protocols.

I think this is a super neat idea, but as far as I know AV-98 is the
only client which supports it and, since I wrote Agena, multiple other
clients have actually gained native Gemini and Gopher support (like
Bombadillo and Castor), so I guess the incentive is reduced (which is a
shame, because there are at least two public Agena proxies running, as
far as I know, one of which only went up very recently).

But I think it demonstrates that the proxying idea does have really
interesting and useful applications.

If you wanted to surf Geminispace from work during your lunch break but
your employer's firewall disallows outgoing traffic on port 1965, you
could setup a proxying Geminiserver on a server you control which
listens on a port other than 1965 and restrict access to it by requiring
a client certificate which is on a whitelist, and whitelist only your
work machine(s), to work around this.

I'm sure other examples exist, too.

Cheers,
Solderpunk

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Mon, May 18, 2020 at 05:37:15PM -0500, epoch wrote:
 
> https://tools.ietf.org/html/rfc5147#section-4.1
> 
> "   In Internet messages, line endings in text/plain MIME entities are
>    represented by CR+LF character sequences (see RFC 2046 [3] and RFC
>    3676 [6]).  However, some protocols (such as HTTP) additionally allow
>    other conventions for line endings."
> 
> precedent of protocols allowing for other line endings, so the gemini spec
> could just say the same thing that http does about line endings.
> 
> https://tools.ietf.org/html/rfc2616#section-3.7.1
> 
> "   When in canonical form, media subtypes of the "text" type use CRLF as
>    the text line break. HTTP relaxes this requirement and allows the
>    transport of text media with plain CR or LF alone representing a line
>    break when it is done consistently for an entire entity-body. HTTP
>    applications MUST accept CRLF, bare CR, and bare LF as being
>    representative of a line break in text media received via HTTP."

Aaah, wonderful!  Thanks for bringing this to my attention.

I *had* been wondering whether or not mainstream webservers were doing
newline translation for text/html, which I had convinced myself they
should have been doing, in principle.

Okay, so we can actually spec whatever we like here for stuff that goes
over the wire. CRLF is only mandatory for "canonical form" (although if
that's not what is on the disk and not what is on the wire, what the
heck *is* it?).

So now the question is which do we prefer, forcing servers to
canonicalise (canonise?) everything, with client behaviour then being
strictly limited, or letting servers just send whatever the author uses
and forcing clients to accept any of CRLF, CR or LF?

In general I have preferred pushing burden to server authors over client
authors for Gemini...but I expect the client burden is pretty small here
as most languages will have some kind of functionality for breaking
strings into lines which already handles this...

Cheers,
Solderpunk

Link to individual message.

defdefred <defdefred (a) protonmail.com>

On Tuesday 19 May 2020 02:07, Sean Conner <sean at conman.org> wrote:

> I thought about autodetection---Unicode is defined in blocks, where each
> alphabet becomes a defined block in Unicode. I then realized that there are
> multiple languages that use the European block. Sure, detecting Greek is
> easy since they have their own alphabet, but what about Spanish, French and
> German? They use the same alphabet.

Autodetection is necessary for document using multiple languages.
Browser preference is fine for hard to detect case.

> > I'm not sure I see the point in the encoding part, though...
> > practically everything can be converted to utf8 rather easily, making 
it a bit useless to
> > specify...
>
> Think legacy documents. And not every legacy encoding scheme can round
> trip through Unicode---I recall there being issues with several east Asian
> languages (Chinese, Japanese in particular).

UTF-8 is capable for all existing language and more :-).
Legacy document if not UTF-8 converted are just out of the gemeni browser scope.
Must gemini browser be able to display every kind of document ?

Regards,
freD.

Link to individual message.

jan6@tilde.ninja <jan6 (a) tilde.ninja>

May 19, 2020 10:46 AM, "solderpunk" <solderpunk at sdf.org> wrote:
> So now the question is which do we prefer, forcing servers to
> canonicalise (canonise?) everything, with client behaviour then being
> strictly limited, or letting servers just send whatever the author uses
> and forcing clients to accept any of CRLF, CR or LF?
> 
> In general I have preferred pushing burden to server authors over client
> authors for Gemini...but I expect the client burden is pretty small here
> as most languages will have some kind of functionality for breaking
> strings into lines which already handles this...
> 
> Cheers,
> Solderpunk

well, if it's ONLY CRLF, then it enables making basic clients in *ANY* 
language, including pure assembly, if someone were brave enough (well, the 
core part, you'd probably write the TLS part in C or something, still)

maybe change the spec slightly, so, after the header, there's CRLF, 
immediately after which is an empty line with the line ending that applies 
for the rest of the document, and require at least one non-empty line after that?
that wouldn't matter for more advanced languages, which can just ignore 
that and use builtin functions, but for more basic ones, it makes it easy 
to read, "is it LF, else is it CR followed by LF, else assume CR"... while 
it would be possible to do that with normal content lines too, it'd be 
harder, as you can't just "eat" a byte or two from the start...

Link to individual message.

defdefred <defdefred (a) protonmail.com>

On Tuesday 19 May 2020 09:20, solderpunk <solderpunk at SDF.ORG> wrote:
> I don't think it's viable for interactive user clients (especially light
> and simple ones) to attempt this, but in the context of, say, a search
> engine which really wants to categorise everything (which is not to say
> that GUS necessarily has to shoulder this burden!), even distinguishing
> languages with the same alphabet is possible by looking at bigram and
> trigram frequencies if there's enough text. German text will have many
> more occurences of "lich" and "heit" than French or Spanish, etc.
Agree and french have ???, spanish ?? and german ? :-)
Nice to have UTF-8 to display all of them in the same document...

Link to individual message.

defdefred <defdefred (a) protonmail.com>

On Tuesday 19 May 2020 09:29, plugd <plugd at thelambdalab.xyz> wrote:
> Please,please don't require autodetection of lanugage or anything
> else. Gopher required autodetection of charsets and content type, which
> was horrible. One of the best things about the current protocol is that
> it makes these things explicit.

Having multiple charset is horrible :-)
UTF-8 will rule them all and converter will clean things up.
Historical document with ancien encoding should be displayed using external viewer.

Link to individual message.

jan6@tilde.ninja <jan6 (a) tilde.ninja>

I'd LOVE if there was a way to indicate multiple languages in one document...

if you write parts in different languages (maybe are writing articles 
about conlangs, or want to have some quotes from another language, or have 
language tutorials or comparisons or something), then it would be really 
useful to allow signifying the change...

and as Nicole said in another thread,
> FWIW, this is also a necessity for writing both Chinese and Japanese in 
the same document, due to the Han unification of Unicode.

would need to be another block type thingy, probably...

while it's complicating the spec a little, it'd be quite useful for some people...

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Tue, May 19, 2020 at 08:12:17PM +0000, jan6 at tilde.ninja wrote:
 
> while it's complicating the spec a little, it'd be quite useful for some people...

Yeah, but *everything*'s quite useful for some people, and individually
each little useful thing is not *that* much more complicated, and then
one day...

Which is not to say nothing like this will ever happen, I actually think
that really good multilingual support is a lot more important than many
other things and might be worth the cost.  But I've definitely changed
my mind that this is a quick and easy thing which can be snuck in along
with defining lining breaks and SNI and stuff.  We can take our time to
figure out a good solution to this.

Cheers,
Solderpunk

Link to individual message.

Nicole Mazzuca <nicole (a) strega-nil.co>

I will say that "allowing someone to send a lang flag per-page" gets you 
the vast majority of the way there, and is a necessity. Changing languages 
is an addition to think about later, imo.

Nicole

-------- Original Message --------
On May 19, 2020, 14:05, solderpunk wrote:

> On Tue, May 19, 2020 at 08:12:17PM +0000, jan6 at tilde.ninja wrote:
>
>> while it's complicating the spec a little, it'd be quite useful for some people...
>
> Yeah, but *everything*'s quite useful for some people, and individually
> each little useful thing is not *that* much more complicated, and then
> one day...
>
> Which is not to say nothing like this will ever happen, I actually think
> that really good multilingual support is a lot more important than many
> other things and might be worth the cost. But I've definitely changed
> my mind that this is a quick and easy thing which can be snuck in along
> with defining lining breaks and SNI and stuff. We can take our time to
> figure out a good solution to this.
>
> Cheers,
> Solderpunk

Link to individual message.

defdefred <defdefred (a) protonmail.com>

I wonder how a native RTL reader should format a gemini document?
Should
#Big Title
Become:
eltiT giB#
or:
#eltiT giB
:-)
Regards,
freD.

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Tue, May 19, 2020 at 07:46:25AM +0000, solderpunk wrote:
 
> So now the question is which do we prefer, forcing servers to
> canonicalise (canonise?) everything, with client behaviour then being
> strictly limited, or letting servers just send whatever the author uses
> and forcing clients to accept any of CRLF, CR or LF?
> 
> In general I have preferred pushing burden to server authors over client
> authors for Gemini...but I expect the client burden is pretty small here
> as most languages will have some kind of functionality for breaking
> strings into lines which already handles this...

Some quick experimentation with Python revealed that str.splitlines()
is, indeed, smart enough to handle CRLF, CR and LF as linebreaks.
Encouraged by this, I made a single Gemini page that mixes all three
styles in a single document:

gemini://gemini.circumlunar.space/users/solderpunk/linebreaks.gmi

I visited it with a selection of clients, but the results weren't
encouraging.

It rendered just fine in AV-98, because apparently Python is awesome at
this.

Curiously, McCross, also in Python, could handle CR fine, split CRLF
into lines but displayed some stray characters at the end of lines, and
did not split LF lines at all.  Perhaps this is because it's a GUI
client and is relying on Tkinter to redner text?

Bombadillo and Castor (an old build of Castor I had lying around,
admittedly) both handled CR and CRLF fine, but did not split LF lines at
all.  The same was true of my tiny demo Lua and Go clients.

So, speccing any line ending as permissble, like HTML does, would
seemingly immediately render most clients out-of-spec.

Specifying CR or CRLF but not LF would require the minimum amount of
rework, but it would be hard to justify this by anything other than
laziness.  We either go strict, and require CRLF only, or we go
permissive, and allow anything, but we can't really pick and mix for the
sake of convenience.

We basically need to choose between forcing server authors to normalise
all endings to CRLF or forcing client authors to recgonise LF (even
though it'll probably never be seen in the wild).

Neither really thrill me, but, well, here we are.

Cheers,
Solderpunk

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great solderpunk once stated:
> 
> So, speccing any line ending as permissble, like HTML does, would
> seemingly immediately render most clients out-of-spec.

  Please, *please*, *PLEASE* do not let this disuade you from making most
clients out of spec.  We have had to suffer terrible things because of this
thinking (like Makefiles and tabs---"when I realized my mistake, there were
already 10 people using it").

> Specifying CR or CRLF but not LF would require the minimum amount of
> rework, but it would be hard to justify this by anything other than
> laziness. 

  And bizareness.  The last time I worked on any systems that used only CR
was back in the 80s.  Today everybody uses either CRLF (Windows) or LF
(Linux, Mac OS-X, whatever remaining bits of Unix are still around).

> We basically need to choose between forcing server authors to normalise
> all endings to CRLF or forcing client authors to recgonise LF (even
> though it'll probably never be seen in the wild).

  Would that be only for text/gemini?  Or all of the text/* formats?

  -spc

Link to individual message.

jan6@tilde.ninja <jan6 (a) tilde.ninja>

May 21, 2020 11:47 PM, "solderpunk" <solderpunk at sdf.org> wrote:
> We basically need to choose between forcing server authors to normalise
> all endings to CRLF or forcing client authors to recgonise LF (even
> though it'll probably never be seen in the wild).
> 
> Neither really thrill me, but, well, here we are.

"it'll probably never be seen in the wild" why?
just LF is the standard linux line ending, and if it's allowed, I'm fairly 
certain there will be several servers which would serve the files as-is, 
and given they'd likely be on some sort of unix machine, likely be ending with LF

CR on the other hand, is very unlikely to be found in the wild, since it's 
used by any living OS, afaik, and also should, in theory, just move cursor 
to the start of line, instead of putting a newline, at least on linux 
terminals... which would overwrite the line...

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Thu, May 21, 2020 at 08:56:57PM +0000, jan6 at tilde.ninja wrote:
> May 21, 2020 11:47 PM, "solderpunk" <solderpunk at sdf.org> wrote:
> > We basically need to choose between forcing server authors to normalise
> > all endings to CRLF or forcing client authors to recgonise LF (even
> > though it'll probably never be seen in the wild).
> > 
> > Neither really thrill me, but, well, here we are.
> 
> "it'll probably never be seen in the wild" why?
> just LF is the standard linux line ending, and if it's allowed, I'm 
fairly certain there will be several servers which would serve the files 
as-is, and given they'd likely be on some sort of unix machine, likely be ending with LF
> 
> CR on the other hand, is very unlikely to be found in the wild, since 
it's used by any living OS, afaik, and also should, in theory, just move 
cursor to the start of line, instead of putting a newline, at least on 
linux terminals... which would overwrite the line...

Argh, yep, sorry.  Exchange *all* instances of lone CR and lone LF in
my entire last email, I mixed them up.

Cheers,
Poldersunk

Link to individual message.

Luke Emmet <luke.emmet (a) gmail.com>


On 21-May-2020 21:47, solderpunk wrote:
> On Tue, May 19, 2020 at 07:46:25AM +0000, solderpunk wrote:
>> So now the question is which do we prefer, forcing servers to
>> canonicalise (canonise?) everything, with client behaviour then being
>> strictly limited, or letting servers just send whatever the author uses
>> and forcing clients to accept any of CRLF, CR or LF?
>>
>> In general I have preferred pushing burden to server authors over client
>> authors for Gemini...but I expect the client burden is pretty small here
>> as most languages will have some kind of functionality for breaking
>> strings into lines which already handles this...
> Some quick experimentation with Python revealed that str.splitlines()
> is, indeed, smart enough to handle CRLF, CR and LF as linebreaks.
> Encouraged by this, I made a single Gemini page that mixes all three
> styles in a single document:
>
> gemini://gemini.circumlunar.space/users/solderpunk/linebreaks.gmi
>
> I visited it with a selection of clients, but the results weren't
> encouraging.
>
> It rendered just fine in AV-98, because apparently Python is awesome at
> this.
As a client author, I think we have to expect a mix of line endings, but 
perhaps not commonly in the same document. But even this will happen at 
some stage when you have multiple authors contributing to the same file. 
Most text editors will handle this without complaint. I think it is not 
to be surprising that servers will generally just serve up the GMI file 
as is. End users/authors will simply use whatever text editor and os 
defaults it uses.

You can normalise them very simply just before display as follows (or 
similar scheme)

  - Replace all CRLF with LF in the document.
  - Replace all plain CR with LF
  - Then display, splitting at LF
  - Job done.

This is not onerous or a big performance hit as GMI files are in the 
grand scheme of things *very small* text files.

Works for me in my WIP client as a first approximation - the above link 
looks fine. But if the language has a library for this, we should just 
use it.

Best Wishes

   - Luke

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Thu, May 21, 2020 at 04:54:47PM -0400, Sean Conner wrote:
 
>   Please, *please*, *PLEASE* do not let this disuade you from making most
> clients out of spec.  We have had to suffer terrible things because of this
> thinking (like Makefiles and tabs---"when I realized my mistake, there were
> already 10 people using it").

Point taken.
 
> > Specifying CR or CRLF but not LF would require the minimum amount of
> > rework, but it would be hard to justify this by anything other than
> > laziness. 
> 
>   And bizareness.  The last time I worked on any systems that used only CR
> was back in the 80s.  Today everybody uses either CRLF (Windows) or LF
> (Linux, Mac OS-X, whatever remaining bits of Unix are still around).

Yep, as stated, this should have been "LR or CRLF".  Speccing that would
require minimal changes to servers (which could just serve files
verbatim and assume nobody is generating CR-only files in this day and
age), and clients (most of which could handle these two with either no
trouble or minimal trouble).

But doesn't just picking two out of the three seem lazy?  Or is that
just me?  Is it okay in 2020 to just write bare CR out of existence?

> > We basically need to choose between forcing server authors to normalise
> > all endings to CRLF or forcing client authors to recgonise LF (even
> > though it'll probably never be seen in the wild).
> 
>   Would that be only for text/gemini?  Or all of the text/* formats?

Ugh.  Let me think...

Cheers,
Solderpunk

Link to individual message.

Martin Keegan <martin (a) no.ucant.org>

On Thu, 21 May 2020, solderpunk wrote:

> We basically need to choose between forcing server authors to normalise
> all endings to CRLF or forcing client authors to recgonise LF (even
> though it'll probably never be seen in the wild).

Could we have a bit of a breather to allow the implications to sink in, 
and, critically, to allow the development of conformance testing tools?

If there were a tool which could be run on a document, that confirmed that 
it was conformant, and a similar tool for server behaviour, and people 
had had some time to try to integrate these with the existing 
software, it'd be easier to assess the tradeoffs involved in the spec 
decision.

Mk

-- 
Martin Keegan, +44 7779 296469, @mk270, https://mk.ucant.org/

Link to individual message.

Julien Blanchard <julien (a) typed-hole.org>

Le 21 mai 2020 ? 22:47, solderpunk <solderpunk at sdf.org> a ?crit :

> Bombadillo and Castor (an old build of Castor I had lying around,
> admittedly) both handled CR and CRLF fine, but did not split LF lines at
> all.  The same was true of my tiny demo Lua and Go clients.

FWIW this wouldn?t be an issue to change in Castor as I had to add code to 
split on CLRF explicitly if I recall correctly.

Link to individual message.

Brian Evans <b__m__e (a) mailfence.com>

In the CR, LF , and CRLF debate I find myself in favor of 
not worrying about the relic that is CR only. No modern systems use this
and as a new protocol I dont see a compelling reason to support ancient systems.
Particularly given that even among ancient systems just a CR as a line ending
was a rarity.

(The below quote has been edited to provide the character that was actually
meant, per solderpunk's later post):
> Bombadillo and Castor (an old build of Castor I had lying around,
> admittedly) both handled LF and CRLF fine, but did not split CR lines at
> all.  

Bombadillo should be able to handle two of the three. For dealing with lines
I split on newline and remove trailing whitespace from all lines when rendering.
Bombadillo will also never render a `\r` character (the line wrapping parses them
out of the render for display purposes). I imagine Castor and Asuka are doing 
something vaguely similar.


> Some quick experimentation with Python revealed that str.splitlines()
> is, indeed, smart enough to handle CRLF, CR and LF as linebreaks.

That _is_ pretty cool that the python method was able to detect \r line endings
and split them though. Points to python on that one.

Link to individual message.

✈個展 <jetkoten (a) gmail.com>

I think what Martin is saying is some of what I was trying to convey at
with my "gemini-fmt idea" proposal on another thread? that Gemini content
authors run their texts through a spec conforming tool before a text ever
reaches the servers and clients of Geminispace.

Whatever combination of CR, LF or CR LF is in the pre-conformed text it
will be correct when it gets on the server and correct when it gets to the
client if run through such a tool.

The people who are writing the server and client software are doing a huge
service to the Gemini community, so please don't saddle them with the
admittedly tedious work of writing code to check in the server and then
check again in the client if it is an improper combination of CR, LF or CR
LF and then even more code to re-conform it. That kind of thing really
seems to go against the 100 lines of code, code it in a weekend Gemini
spirit in my opinion.

Thanks for your consideration.

J

On Thu, May 21, 2020, 16:18 Martin Keegan <martin at no.ucant.org> wrote:

> On Thu, 21 May 2020, solderpunk wrote:
>
> > We basically need to choose between forcing server authors to normalise
> > all endings to CRLF or forcing client authors to recgonise LF (even
> > though it'll probably never be seen in the wild).
>
> Could we have a bit of a breather to allow the implications to sink in,
> and, critically, to allow the development of conformance testing tools?
>
> If there were a tool which could be run on a document, that confirmed that
> it was conformant, and a similar tool for server behaviour, and people
> had had some time to try to integrate these with the existing
> software, it'd be easier to assess the tradeoffs involved in the spec
> decision.
>
> Mk
>
> --
> Martin Keegan, +44 7779 296469, @mk270, https://mk.ucant.org/
>

Link to individual message.

Luke Emmet <luke.emmet (a) gmail.com>

On 21-May-2020 22:17, Martin Keegan wrote:
> On Thu, 21 May 2020, solderpunk wrote:
>
>> We basically need to choose between forcing server authors to normalise
>> all endings to CRLF or forcing client authors to recgonise LF (even
>> though it'll probably never be seen in the wild).
>
> Could we have a bit of a breather to allow the implications to sink 
> in, and, critically, to allow the development of conformance testing 
> tools?
>
> If there were a tool which could be run on a document, that confirmed 
> that it was conformant, and a similar tool for server behaviour, and 
> people had had some time to try to integrate these with the existing 
> software, it'd be easier to assess the tradeoffs involved in the spec 
> decision.

I would think Postels rule should pragmatically apply to line endings in 
the response body, but the spec should definitely be very specific about 
line endings in any headers (as is http). But those are generated by the 
server anyway.

If we force one line ending kind on authors, it will be a deterrent to 
them forging ahead with writing user content if they have to use some 
tool just to be able to get the content onto the server. Look at XHTML, 
it was a resounding failure and rejected by authors even though there 
were some good (and ill conceived) intentions of the spec writers.

Best Wishes

  - Luke

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Thu, May 21, 2020 at 10:28:44PM +0100, Luke Emmet wrote:
 
> I would think Postels rule should pragmatically apply to line endings in the
> response body, but the spec should definitely be very specific about line
> endings in any headers (as is http). But those are generated by the server
> anyway.

No question, the response header ends in CRLF as per internet spec
convention.  This discussion is purely about the response body.

Cheers,
Solderpunk

Link to individual message.

Martin Keegan <martin (a) no.ucant.org>

On Thu, 21 May 2020, Luke Emmet wrote:

> If we force one line ending kind on authors, it will be a deterrent to them 
> forging ahead with writing user content if they have to use some tool just to 
> be able to get the content onto the server. Look at XHTML, it was a 
> resounding failure and rejected by authors even though there were some good 
> (and ill conceived) intentions of the spec writers.

I believe there's a difference between requiring servers only to use CRLF 
in the body, and content authors only to use CRLF. If it were specified 
that servers "MUST NOT" send text/gemini bodies that use line separators 
other than CRLF, then it remains up to server implementers to choose 
whether

1) to require content authors to develop or save their content with 
CRLFs, or

2) to translate the content themselves (either on the fly, or by
caching a fettled version with the right line separators).

This is why I keep banging on about having a gemini-check tool for file 
formats (though I understand someone may now have written one up in Go).

Mk

-- 
Martin Keegan, +44 7779 296469, @mk270, https://mk.ucant.org/

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Thu, May 21, 2020 at 04:28:19PM -0500, ??? wrote:
 
> The people who are writing the server and client software are doing a huge
> service to the Gemini community, so please don't saddle them with the
> admittedly tedious work of writing code to check in the server and then
> check again in the client if it is an improper combination of CR, LF or CR
> LF and then even more code to re-conform it.

I'd rather make life a little harder for developers - who are
technical people who know what a CR and LF are and are, anyway, signing
up for a bit of fiddly detail by undertaking to implement an internet
protocol from scratch - than make life harder for content authors, who
may have no idea what this nonsense is all about and are arguably doing
an even bigger service for the community by providing something to use
our present abundance of servers and clients to read.

Sorry if this seems blunt!

Cheers,
Solderpunk

> seems to go against the 100 lines of code, code it in a weekend Gemini
> spirit in my opinion.
> 
> Thanks for your consideration.
> 
> J
> 
> On Thu, May 21, 2020, 16:18 Martin Keegan <martin at no.ucant.org> wrote:
> 
> > On Thu, 21 May 2020, solderpunk wrote:
> >
> > > We basically need to choose between forcing server authors to normalise
> > > all endings to CRLF or forcing client authors to recgonise LF (even
> > > though it'll probably never be seen in the wild).
> >
> > Could we have a bit of a breather to allow the implications to sink in,
> > and, critically, to allow the development of conformance testing tools?
> >
> > If there were a tool which could be run on a document, that confirmed that
> > it was conformant, and a similar tool for server behaviour, and people
> > had had some time to try to integrate these with the existing
> > software, it'd be easier to assess the tradeoffs involved in the spec
> > decision.
> >
> > Mk
> >
> > --
> > Martin Keegan, +44 7779 296469, @mk270, https://mk.ucant.org/
> >

Link to individual message.

Luke Emmet <luke.emmet (a) gmail.com>

On 21-May-2020 22:50, Martin Keegan wrote:
> On Thu, 21 May 2020, Luke Emmet wrote:
>
>> If we force one line ending kind on authors, it will be a deterrent 
>> to them forging ahead with writing user content if they have to use 
>> some tool just to be able to get the content onto the server. Look at 
>> XHTML, it was a resounding failure and rejected by authors even 
>> though there were some good (and ill conceived) intentions of the 
>> spec writers.
>
> I believe there's a difference between requiring servers only to use 
> CRLF in the body, and content authors only to use CRLF. If it were 
> specified that servers "MUST NOT" send text/gemini bodies that use 
> line separators other than CRLF, then it remains up to server 
> implementers to choose whether
>
> 1) to require content authors to develop or save their content with 
> CRLFs, or
>
> 2) to translate the content themselves (either on the fly, or by
> caching a fettled version with the right line separators).

I see better now what the alternative suggestions are, thanks.

I still think a robust client has to in all likelihood anticipate both 
CRLF and LF. There will be a range of servers out there and just getting 
the content of the file sent to you isnt going to be unusual. Probably 
plain Mac/CR is too ancient to be widely found, but see below.

Another consideration that comes to me now is for any applications of 
Gemini that might require a binary correct transfer of a text file from 
A to B. For example if we are gemini as a content transfer layer for a 
source code repository. There is already a Git over gemini service which 
seems interesting. In these scenarios as the end user, you would expect 
that a GMI file served to you from the repo was the same as it is on the 
server. So I think we should support all "normal" plain text formats as 
the response body and not in general have the server munging or 
adjusting them.

Best Wishes

  - Luke

Link to individual message.

✈個展 <jetkoten (a) gmail.com>

But is it a zero sum game like that really (either make it hard on server
authors or make it hard on content authors)?

The would-be Gemini content author at this point in the game will be
someone who either has created an SSH key and SSHed into a pubnix, used Git
to send their content to a Dome style server or used sftp to upload their
hand written Gemini text.

Is it truly making their life harder to enter:

gemini fmt mytext.gemini<Enter>

and thereby save the server and client authors from having to do all of
that checking logic? They're still are many other ways they can enjoy the
moving target development against an evolving spec aren't there?

:)

I hope my idea doesn't seem hostile somehow, because it's not intended to
be in any way. I just figure save everybody effort and complexity,
everybody wins? feed two birds with one seed.

Thanks

On Thu, May 21, 2020, 16:52 solderpunk <solderpunk at sdf.org> wrote:

> On Thu, May 21, 2020 at 04:28:19PM -0500, ??? wrote:
>
> > The people who are writing the server and client software are doing a
> huge
> > service to the Gemini community, so please don't saddle them with the
> > admittedly tedious work of writing code to check in the server and then
> > check again in the client if it is an improper combination of CR, LF or
> CR
> > LF and then even more code to re-conform it.
>
> I'd rather make life a little harder for developers - who are
> technical people who know what a CR and LF are and are, anyway, signing
> up for a bit of fiddly detail by undertaking to implement an internet
> protocol from scratch - than make life harder for content authors, who
> may have no idea what this nonsense is all about and are arguably doing
> an even bigger service for the community by providing something to use
> our present abundance of servers and clients to read.
>
> Sorry if this seems blunt!
>
> Cheers,
> Solderpunk
>
> > seems to go against the 100 lines of code, code it in a weekend Gemini
> > spirit in my opinion.
> >
> > Thanks for your consideration.
> >
> > J
> >
> > On Thu, May 21, 2020, 16:18 Martin Keegan <martin at no.ucant.org> wrote:
> >
> > > On Thu, 21 May 2020, solderpunk wrote:
> > >
> > > > We basically need to choose between forcing server authors to
> normalise
> > > > all endings to CRLF or forcing client authors to recgonise LF (even
> > > > though it'll probably never be seen in the wild).
> > >
> > > Could we have a bit of a breather to allow the implications to sink in,
> > > and, critically, to allow the development of conformance testing tools?
> > >
> > > If there were a tool which could be run on a document, that confirmed
> that
> > > it was conformant, and a similar tool for server behaviour, and people
> > > had had some time to try to integrate these with the existing
> > > software, it'd be easier to assess the tradeoffs involved in the spec
> > > decision.
> > >
> > > Mk
> > >
> > > --
> > > Martin Keegan, +44 7779 296469, @mk270, https://mk.ucant.org/
> > >
>

Link to individual message.

Matt Brubeck <mbrubeck (a) limpet.net>

On Thu, May 21, 2020 at 1:47 PM solderpunk <solderpunk at sdf.org> wrote:
> Some quick experimentation with Python revealed that str.splitlines()
> is, indeed, smart enough to handle CRLF, CR and LF as linebreaks.

For what it's worth, the Rust standard library is designed to work
with both LF ("\n") and CRLF ("\r\n") line endings, but not CR alone
("\r"):
https://github.com/rust-lang/rfcs/blob/master/text/1212-line-endings.md

In the discussion that led to this, there was no real support for
supporting "\r" line endings, nor for other uncommon newlines like
U+2028 LINE SEPARATOR:
https://github.com/rust-lang/rfcs/pull/1212

I think text/gemini should allow both "\n" and "\r\n". Requiring a
single style would either place a burden on content authors (on
servers that don't auto-convert line endings), or make servers more
complex and less efficient. (They would need to inspect every
text/gemini file and possibly transform it, rather than stream it
directly from disk to the network.)  Meanwhile, I think client
software can handle either option fairly simply.

For the protocol header, I don't have a strong opinion. Any choice we
make is fairly easy to implement in both server and client. However, I
have a slight preference for specifying just "\n" as the header
terminator. One byte is (marginally) more efficient than two, and in
some languages it's slightly simpler to split on a single-byte
delimiter.

Link to individual message.

Nicole Mazzuca <nicole (a) strega-nil.co>

I am a strong proponent of "follow the existing practice if there isn't 
good reason to change". I personally think CRLF should be recommended, but 
not required for content, as text/* is wont to do over HTTP; I think 
expecting clients to deal with any of {CR, LF, CRLF} is totally fine. 
However, the header should _absolutely_ end in CRLF, as every existing 
protocol works this way.


Nicole

??????? Original Message ???????
On Thursday, May 21, 2020 5:45 PM, Matt Brubeck <mbrubeck at limpet.net> wrote:

> On Thu, May 21, 2020 at 1:47 PM solderpunk solderpunk at sdf.org wrote:
>
> > Some quick experimentation with Python revealed that str.splitlines()
> > is, indeed, smart enough to handle CRLF, CR and LF as linebreaks.
>
> For what it's worth, the Rust standard library is designed to work
> with both LF ("\n") and CRLF ("\r\n") line endings, but not CR alone
> ("\r"):
> https://github.com/rust-lang/rfcs/blob/master/text/1212-line-endings.md
>
> In the discussion that led to this, there was no real support for
> supporting "\r" line endings, nor for other uncommon newlines like
> U+2028 LINE SEPARATOR:
> https://github.com/rust-lang/rfcs/pull/1212
>
> I think text/gemini should allow both "\n" and "\r\n". Requiring a
> single style would either place a burden on content authors (on
> servers that don't auto-convert line endings), or make servers more
> complex and less efficient. (They would need to inspect every
> text/gemini file and possibly transform it, rather than stream it
> directly from disk to the network.) Meanwhile, I think client
> software can handle either option fairly simply.
>
> For the protocol header, I don't have a strong opinion. Any choice we
> make is fairly easy to implement in both server and client. However, I
> have a slight preference for specifying just "\n" as the header
> terminator. One byte is (marginally) more efficient than two, and in
> some languages it's slightly simpler to split on a single-byte
> delimiter.

Link to individual message.

Ben <benulo (a) systemli.org>

Isn't CRLF a DOS/Windows thing? Why use it at all?

-- 
gemini://kwiecien.us/

Link to individual message.

Nicole Mazzuca <nicole (a) strega-nil.co>

CRLF is the canonical representation of text/* file types; see 
https://tools.ietf.org/html/rfc2616#section-3.7.1


Nicole

??????? Original Message ???????
On Thursday, May 21, 2020 7:12 PM, Ben <benulo at systemli.org> wrote:

> Isn't CRLF a DOS/Windows thing? Why use it at all?
>
> -------------------------------------------------------
>
> gemini://kwiecien.us/

Link to individual message.

Matt Brubeck <mbrubeck (a) limpet.net>

On Thu, May 21, 2020 at 7:02 PM Nicole Mazzuca <nicole at strega-nil.co> wrote:
> However, the header should _absolutely_ end in CRLF, as every existing 
protocol works this way.

Yeah, that's fair. I?ve just caught up on the parts of this thread
that happened before I subscribed, and I agree that making the
protocol work the same as other line-based protocols is compelling.
(And Solderpunk said that this part is already decided, anyway, which
is fine.)

Link to individual message.

plugd <plugd (a) thelambdalab.xyz>


jan6 at tilde.ninja writes:
> I'd also be in favor of server handling it, although that is a kinda-valid point...
> html doesn't count because the browsers will do their best to fix all 
kinds of TOTALLY BROKEN html,
> you can have partial tags, no end tags for some, etc, and the browser 
will never tell the user your
> site is a trashpile, it will just silently try its best to fix it up 
(even in inspect element,
> you'd see the "repaired" version)...

And this is exactly what's going to happen with gemini too.  The second
you have >=2 clients around, there's going to be a race to accept the
widest range of content, regardless of errors.  Besides pleaces where it
puts users' privacy/security at risk, clients can never be expected to
police complience of the document with the spec.  (After all, the client
user is almost always someone entirely unresponsible for the generation
of the document.)

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Fri, May 22, 2020 at 06:42:30AM +0430, Ben wrote:
> Isn't CRLF a DOS/Windows thing? Why use it at all?

It's not often I'll say anything that might be perceived as "sticking up
for Microsoft" in the context of storing text (I mean, really, Office

delimiters in .csv files depending on the OS locale, making cross-locale
file exchange a pain), but we really shouldn't demonise them for using
CRLF (this is mostly in response, just to be clear, to a HN comment
calling CRLF something like "a Microsoft abomination".

It's true Windows is the only place using CRLF these days, but it's not
like the rest of the world has always use plain old LF and MS decided to
be different just for the sake of different.  DOS inherited CRLF from
CP/M, and Windows is just the "last man standing" from a long CRLF
tradition dating back to the days of physical teletypes, and which
included systems with a lot more geek street cred, like DEC's TOPS-10
and RT-11.

IETF's decision to use CRLF as the canonical form (which was apparently
mostly driven by Postel, according to
https://www.rfc-editor.org/old/EOLstory.txt) was a perfectly sensible
one at the time, when plain CR was also in use (on Macs, LISP machines
and many 8-bit home microcomputers), as it guaranteed lines would always
get split.

Anyway, I've made a deicision on this which, I'll post shortly.

Cheers,
Solderpunk

Link to individual message.

plugd <plugd (a) thelambdalab.xyz>

solderpunk writes:
> be different just for the sake of different.  DOS inherited CRLF from
> CP/M, and Windows is just the "last man standing" from a long CRLF
> tradition dating back to the days of physical teletypes, and which
> included systems with a lot more geek street cred, like DEC's TOPS-10
> and RT-11.

Which makes sense, given that you need a line feed to advance to the
next line, and a carriage return to return the carriage to the first
column.  :-)

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Fri, May 22, 2020 at 09:01:45AM +0200, plugd wrote:
> 
> jan6 at tilde.ninja writes:
> > html doesn't count because the browsers will do their best to fix all 
kinds of TOTALLY BROKEN html,
> > you can have partial tags, no end tags for some, etc, and the browser 
will never tell the user your
> > site is a trashpile, it will just silently try its best to fix it up
> 
> And this is exactly what's going to happen with gemini too.

I kind of hope that it won't ever be possible for a text/gemini page to

"invalid".  There is no concept of anything coming in opening/closing
pairs.  Whitespace is always optional so it doesn't matter whether it's
there or not.

A link line where the first thing after => can't be parsed as a URL is a
genuine invalidity, but remember that relative links are allowed, so
even:

=> one two three

is fine (a relative link to a path ending in "one", with label "two
three").  Something like:

=> foo://bar://baz:// Chew on this!

is a real problem, but it's not going to happen by accident.

Cheers,
Solderpunk

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Fri, May 22, 2020 at 05:01:34PM +0200, plugd wrote:
 
> Which makes sense, given that you need a line feed to advance to the
> next line, and a carriage return to return the carriage to the first
> column.  :-)

Yeah, it's not even a "DEC thing", it's a "cold, hard, mechanical
reality" thing!

Cheers,
Solderpunk

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

Thanks, everybody, for your thoughts on this matter!  I've made a
decision.

Let me start out by saying how thoroughly ridiculous it is that in 2020
people can still have spirited discussions about how the concept of "a
line of text" should work!  I hope that one day this is standardised
once and for all across all systems, but I won't hold my breath.

The line of thinking to my decision has gone something like this:

While it's true that the spec currently is totally ambiguous on what our
supposedly line-oriented format uses to define a line, and that's a
genuine problem which needs to be fixed because specs *shouldn't* be
ambiguous about this kind of thing (thanks to everybody who flagged this
issue!), it's also true that as far as I'm aware this ambiguity has
created precisely zero interoperability problems for anybody.

In a situation where some spec detail is ambiguous but everything is
working just fine 100%, the obvious default course of action should be
to codify whatever the current practice is.

That seems to be everybody using plain LF.  So, there would need to be a
very compelling reason *not* to allow plain LF.  I can't think of any
and nobody has mentioned any, so, first conclusion: plain LF has to be
allowed.

As previously noticed, it is already IETF-mandated that the canonical
representation of anything with a text/* MIME type is CRLF.  HTTP sets a
precedent of protocols being able to permit additional line endings on
top of that, but it seems a very different proposition for a protocol to

CRLF has to be allowed as well.

This leaves only the question of whether or not plain CR should be
allowed as a third option.  Given the complete absence of any
CR-separated content in Geminispace so far, the poor support for
CR-separated line recognition in contemporary programming languages (and
the consequent poor support for it in almost all extant Gemini cliens),
and the fact that mandatory TLS means anybody who really wants to
implement Gemini on an old CR-based system is going to have muuuuuch
bigger problems to worry about than translating line endings, this seems
very hard to justify.  I did previously have the notion that this was
somehow "the right thing to do", but having had to spend more time and
energy thinking about this issue than it really deserves, I'm a bit more
inclined to help "grease the wheels of history" on its clear path toward
reduced variability in such a fundamental matter.  LFCR is now well and
truly dead, and CR is very, very close to it.  It's clear that other new
technologies are choosing to leave it behind, so we might as well follow
suit and help the world get to a place of only having to worry about two
common EOLs instead of three (and may our grandchildren only have to
worry about one!).  Conclusion the third, let's not allow CR.

Since CRLF is the canonical form of all text/* subtypes, these changes
to the Gemini spec won't go in section 1.3.5 (which defines
text/gemini), but in section 1.3.3 (which defines response bodies in
general, and is where we e.g. define UTF-8 as the default encoding).  I
will probably borrow verbatim the wording used by the HTTP spec but just
leave out CR.

This solves the ambiguity of the spec on this matter, and should involve
very little actual work from anybody to attain/retain compliance - which
is exactly how it should be, given that everything is already working
just fine despite the ambiguity.  In particular, given the complete lack
of modern systems using anything other than LF or CRLF, server munging
should not be necessary and Gemini content can be served as verbatim
binary data from the filesystem, which is a very desirable property.

I'll make this change, along with the two other uncontroversial
housekeeping changes (mandating SNI and requiring response headers to
use exactly one space instead of arbitrary whitespace), tonight or
tomorow.  Then we can shift focus to more important things, like client
certificates.

Cheers,
Solderpunk

Link to individual message.

Luke Emmet <luke.emmet (a) gmail.com>


On 22-May-2020 17:07, solderpunk wrote:
> Thanks, everybody, for your thoughts on this matter!  I've made a
> decision
>
> <snip>
>
> This solves the ambiguity of the spec on this matter, and should involve
> very little actual work from anybody to attain/retain compliance - which
> is exactly how it should be, given that everything is already working
> just fine despite the ambiguity.  In particular, given the complete lack
> of modern systems using anything other than LF or CRLF, server munging
> should not be necessary and Gemini content can be served as verbatim
> binary data from the filesystem, which is a very desirable property

Not that you need my endorsement, but this gets a thankful thumbs up 
from me.

  - Luke

Link to individual message.

plugd <plugd (a) thelambdalab.xyz>

solderpunk writes:
> On Fri, May 22, 2020 at 09:01:45AM +0200, plugd wrote:
>> And this is exactly what's going to happen with gemini too.
>
> I kind of hope that it won't ever be possible for a text/gemini page to
> *be* a trashpile.  It's genuinely very hard to create something
> "invalid".  There is no concept of anything coming in opening/closing
> pairs.  Whitespace is always optional so it doesn't matter whether it's
> there or not.

Definitely agree, at the moment it all seems very close to optimal.  And
sorry, I didn't mean to sound so defeatist!  I was just getting worried
that some of the ideas being proposed for closing exploitation holes in
text/gemini were of this "unenforceable" variety.

> A link line where the first thing after => can't be parsed as a URL is a
> genuine invalidity, but remember that relative links are allowed, so
> even:
>
> => one two three
>
> is fine (a relative link to a path ending in "one", with label "two
> three").  Something like:
>
> => foo://bar://baz:// Chew on this!
>
> is a real problem, but it's not going to happen by accident.

True. And _all_ clients would have to go out of their way and conspire
together to interpret this as anything but garbage.

... then again, if clients start to act defensively and ignore malformatted link
lines, it wouldn't be impossible for document authors to start
incorporating intentionally malformatted lines containing non-standard
directives:

=> foo://bar://inline-image:cat.jpg

Has anybody developed a general theory of specification creep? :-)

Tim

Link to individual message.

Simon Forman <spf (a) flowkarma.live>

On 5/22/2020 at 9:13 AM, "Luke Emmet" <luke.emmet at gmail.com> wrote:
>
>On 22-May-2020 17:07, solderpunk wrote:
>> Thanks, everybody, for your thoughts on this matter!  I've made a
>> decision
>>
>Not that you need my endorsement, but this gets a thankful thumbs 
>up 
>from me.

Ditto.  This seems very well thought out to me.  Kudos!

~Simon

Link to individual message.

solderpunk <solderpunk (a) SDF.ORG>

On Fri, May 22, 2020 at 10:24:55AM -0700, Simon Forman wrote:
> On 5/22/2020 at 9:13 AM, "Luke Emmet" <luke.emmet at gmail.com> wrote:
> >
> >On 22-May-2020 17:07, solderpunk wrote:
> >> Thanks, everybody, for your thoughts on this matter!  I've made a
> >> decision
> >>
> >Not that you need my endorsement, but this gets a thankful thumbs 
> >up 
> >from me.
> 
> Ditto.  This seems very well thought out to me.  Kudos!
> 

Thank you both!

Cheers,
Solderpunk

Link to individual message.

Sean Conner <sean (a) conman.org>

It was thus said that the Great plugd once stated:
> 
> Has anybody developed a general theory of specification creep? :-)

  Yes.  There's Zawinski's Law:  Every program attempts to expand until it
can read mail. Those programs which cannot so expand are replaced by ones
which can.

  You can read some commentary about it from these links:

https://en.wikipedia.org/wiki/Jamie_Zawinski#Principles
https://softwareengineering.stackexchange.com/questions/150254/what-does-ja
mie-zawinskis-law-mean

  -spc (Every program has at least one bug and can be shortened by at least
	one instruction---from which, by induction, one can deduce that
	every program can be reduced to one instruction which doesn't work. [1])

[1]	This actually happened.  Not to me, but I know of such a case.

Link to individual message.

epoch <epoch (a) enzo.thebackupbox.net>

On Tue, May 19, 2020 at 09:21:13AM +0000, defdefred wrote:
> On Tuesday 19 May 2020 02:07, Sean Conner <sean at conman.org> wrote:
> 
> > I thought about autodetection---Unicode is defined in blocks, where each
> > alphabet becomes a defined block in Unicode. I then realized that there are
> > multiple languages that use the European block. Sure, detecting Greek is
> > easy since they have their own alphabet, but what about Spanish, French and
> > German? They use the same alphabet.
> 
> Autodetection is necessary for document using multiple languages.
> Browser preference is fine for hard to detect case.
> 

I was thinking, maybe add a "lang" parameter to the end of mime-types, 
like how charset and boundary work.

Then, if people want multiple languages per-"document"? I guess 
per-response. They can use
mime multipart, then when the mime-type for each part is picked, they can 
put the language there.

That doesn't really help with "inline" different languages. Example gemini response:

20 multipart/mixed; boundary=longrandomthing
--longrandomthing
Content-Type: text/plain; lang=en-US

color. withOUT a 'u'. ha.
--longrandomthing
Content-Type: text/plain; lang=jp

[insert some japanese here]
--longrandomthing--


This document seems like it might come in handy:

https://www.w3.org/International/articles/language-tags/

Link to individual message.

---

Previous Thread: Example Gemini pages?

Next Thread: [ANN] another Gemini server, blizanci