Ambiguity in spec regarding line endings

1. Ryan Kavanagh (rak (a) rak.ac)

I'm reading the current version of the spec, and have come across the
following ambiguous paragraph in ?3.3:

    When in canonical form, media subtypes of the "text" type use CRLF
    as the text line break.  Gemini relaxes this requirement and allows
    the transport of text media with plain LF alone (but NOT a plain CR
    alone) representing a line break when it is done consistently for an
    entire response body.  Gemini clients MUST accept CRLF and bare LF
    as being representative of a line break in text media received via
    HTTP.

How do the second and third sentences interact? In particular, how does

    [...] when it is done consistently for an entire response body.

interact with

    Gemini clients MUST accept CRLF and bare LF as being representative
    of a line break in text media received via HTTP.

How should Gemini clients behave when both CRLF and LF appear in the
same text/gemini transmission? Are both to be equivalently treated as
line breaks?

I've looked through the archives to see what has been said in the past
about line breaks, and the two following messages appear most relevant:

On Sat, Sep 07, 2019 at 04:30:14PM -0400, Jason McBrayer wrote:
> IMO, it makes sense to require CRLF in the plain text parts of the
> protocol (after requests, after the status line of a response), but I
> don't think that the text/gemini file format needs to have CR/LF; IMO
> clients should be prepared to accept either LF or CR/LF just as they
> would with text/plain. And maybe if we're serious about supporting old
> devices, clients should be prepared for bare CR, too (Classic MacOS).
> But it's a pain in the arse to authors to have to save text documents
> with non-native line endings, and I don't feel like servers need to be
> in the business of reformatting the content they serve.

On Sun, Sep 08, 2019 at 02:42:08PM +0000, solderpunk wrote:
> I will admit that the current liberal use of CRLF throughout the
> Gemini spec is the result of me blindly copying from Gopher and other
> RFCs (as Sean mentioned, it's ubiquitous).

Here's [0,1] some of the history of requiring CRLF in network protocols
and in requiring CRLF for text/ subtypes [2] during transmission.

TL;DR: every system has a different native line ending sequence (LF vs
CR vs CRLF). To ensure all can communicate with each other (and to
simplify parsing of communications), transmissions are required to
represent all line endings in text formats by CRLF. Line endings used in
the local storage of text files have *nothing to do* with the line
endings used in transmission, and clients are expected to convert from
CRLF to whatever local format is preferred. So indeed, servers are in
the business of reformatting text/* content that they serve, and they do
so to ensure interoperability between systems with different line ending
conventions.

I think there's a conceptual point to be made here: text/gemini files
are not binary data, but rather, *text files*. This means that their
transmission should not attempt to provide byte-for-byte identical
copies of the local data, but should instead follow well-defined and
agreed-upon representations. If your goal is to transmit a byte-for-byte
identical copy of your file, there are other mime types you can use to
accomplish this (e.g., application/octet-stream).

The FTP protocol makes a similar conceptual distinction. It allows for
text transmission (ASCII and EBCDIC types), where end-of-lines are
defined to be CLRF (ASCII type) and NL (EBCDIC type). It also allows for
a stream / binary transfer mode for transmitting text (and other data)
without any conversion. Quoting from the RFC [4, ?3.4]:

    For the purpose of standardized transfer, the sending host will
    translate its internal end of line or end of record denotation into
    the representation prescribed by the transfer mode and file
    structure, and the receiving host will perform the inverse
    translation to its internal denotation.  [...]  Since these
    transformations imply extra work for some systems, identical systems
    transferring non-record structured text files might wish to use a
    binary representation and stream mode for the transfer.

However, in keeping with Postel's law, I suggest allowing clients to
accept LF as a line ending, as is done by RFC 7230 ?3.5 [3]:

     Although the line terminator for the start-line and header fields
     is the sequence CRLF, a recipient MAY recognize a single LF as a
     line terminator and ignore any preceding CR.

Conclusion:

To eliminate ambiguity and to make the gemini protocol consistent with
every other text transmission protocol I know of, I propose amending the
ambiguous paragraph in the spec as follows:

    As specified in RFC 2046 ?4.1.1, the canonical form of any MIME
    "text" subtype MUST always represent a line break as a CRLF
    sequence. For robustness, a recipient MAY recognize a single LF as
    a line terminator and ignore any preceding CR in text media.

Best,
Ryan

[0] https://www.rfc-editor.org/old/EOLstory.txt
[1] https://tools.ietf.org/html/rfc318
    [ page 8, "End of Line Convention" ]
[2] https://tools.ietf.org/html/rfc2046#section-4.1.1
[3] https://tools.ietf.org/html/rfc7230#section-3.5
[4] https://tools.ietf.org/html/rfc959

-- 
|)|/  Ryan Kavanagh      | GPG: 4E46 9519 ED67 7734 268F
|\|\  https://rak.ac     |      BD95 8F7B F8FC 4A11 C97A
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 1873 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20200604/fbf5
d0ed/attachment.sig>

Link to individual message.

2. Petite Abeille (petite.abeille (a) gmail.com)



> On Jun 4, 2020, at 18:08, Ryan Kavanagh <rak at rak.ac> wrote:
> 
>    As specified in RFC 2046 ?4.1.1, the canonical form of any MIME
>    "text" subtype MUST always represent a line break as a CRLF
>    sequence. For robustness, a recipient MAY recognize a single LF as
>    a line terminator and ignore any preceding CR in text media.

$ delcr | gemini | addcr

https://cr.yp.to/ucspi-tcp/addcr.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20200604/4154
c5b5/attachment.htm>

Link to individual message.

3. prisonpotato (a) tilde.team (prisonpotato (a) tilde.team)

I disagree with this idea, as it adds a signifigant burden to both
server implementations and client implementations running on unix
systems.

On Thu, Jun 04, 2020 at 12:08:44PM -0400, Ryan Kavanagh wrote:
> I'm reading the current version of the spec, and have come across the
> following ambiguous paragraph in ?3.3:
> 
>     When in canonical form, media subtypes of the "text" type use CRLF
>     as the text line break.  Gemini relaxes this requirement and allows
>     the transport of text media with plain LF alone (but NOT a plain CR
>     alone) representing a line break when it is done consistently for an
>     entire response body.  Gemini clients MUST accept CRLF and bare LF
>     as being representative of a line break in text media received via
>     HTTP.
> 
> How do the second and third sentences interact? In particular, how does
> 
>     [...] when it is done consistently for an entire response body.
> 
> interact with
> 
>     Gemini clients MUST accept CRLF and bare LF as being representative
>     of a line break in text media received via HTTP.
> 
> How should Gemini clients behave when both CRLF and LF appear in the
> same text/gemini transmission? Are both to be equivalently treated as
> line breaks?
> 
> I've looked through the archives to see what has been said in the past
> about line breaks, and the two following messages appear most relevant:
> 
> On Sat, Sep 07, 2019 at 04:30:14PM -0400, Jason McBrayer wrote:
> > IMO, it makes sense to require CRLF in the plain text parts of the
> > protocol (after requests, after the status line of a response), but I
> > don't think that the text/gemini file format needs to have CR/LF; IMO
> > clients should be prepared to accept either LF or CR/LF just as they
> > would with text/plain. And maybe if we're serious about supporting old
> > devices, clients should be prepared for bare CR, too (Classic MacOS).
> > But it's a pain in the arse to authors to have to save text documents
> > with non-native line endings, and I don't feel like servers need to be
> > in the business of reformatting the content they serve.
> 
> On Sun, Sep 08, 2019 at 02:42:08PM +0000, solderpunk wrote:
> > I will admit that the current liberal use of CRLF throughout the
> > Gemini spec is the result of me blindly copying from Gopher and other
> > RFCs (as Sean mentioned, it's ubiquitous).
> 
> Here's [0,1] some of the history of requiring CRLF in network protocols
> and in requiring CRLF for text/ subtypes [2] during transmission.
> 
> TL;DR: every system has a different native line ending sequence (LF vs
> CR vs CRLF). To ensure all can communicate with each other (and to
> simplify parsing of communications), transmissions are required to
> represent all line endings in text formats by CRLF. Line endings used in
> the local storage of text files have *nothing to do* with the line
> endings used in transmission, and clients are expected to convert from
> CRLF to whatever local format is preferred. So indeed, servers are in
> the business of reformatting text/* content that they serve, and they do
> so to ensure interoperability between systems with different line ending
> conventions.
> 
> I think there's a conceptual point to be made here: text/gemini files
> are not binary data, but rather, *text files*. This means that their
> transmission should not attempt to provide byte-for-byte identical
> copies of the local data, but should instead follow well-defined and
> agreed-upon representations. If your goal is to transmit a byte-for-byte
> identical copy of your file, there are other mime types you can use to
> accomplish this (e.g., application/octet-stream).
> 
> The FTP protocol makes a similar conceptual distinction. It allows for
> text transmission (ASCII and EBCDIC types), where end-of-lines are
> defined to be CLRF (ASCII type) and NL (EBCDIC type). It also allows for
> a stream / binary transfer mode for transmitting text (and other data)
> without any conversion. Quoting from the RFC [4, ?3.4]:
> 
>     For the purpose of standardized transfer, the sending host will
>     translate its internal end of line or end of record denotation into
>     the representation prescribed by the transfer mode and file
>     structure, and the receiving host will perform the inverse
>     translation to its internal denotation.  [...]  Since these
>     transformations imply extra work for some systems, identical systems
>     transferring non-record structured text files might wish to use a
>     binary representation and stream mode for the transfer.
> 
> However, in keeping with Postel's law, I suggest allowing clients to
> accept LF as a line ending, as is done by RFC 7230 ?3.5 [3]:
> 
>      Although the line terminator for the start-line and header fields
>      is the sequence CRLF, a recipient MAY recognize a single LF as a
>      line terminator and ignore any preceding CR.
> 
> Conclusion:
> 
> To eliminate ambiguity and to make the gemini protocol consistent with
> every other text transmission protocol I know of, I propose amending the
> ambiguous paragraph in the spec as follows:
> 
>     As specified in RFC 2046 ?4.1.1, the canonical form of any MIME
>     "text" subtype MUST always represent a line break as a CRLF
>     sequence. For robustness, a recipient MAY recognize a single LF as
>     a line terminator and ignore any preceding CR in text media.
> 
> Best,
> Ryan
> 
> [0] https://www.rfc-editor.org/old/EOLstory.txt
> [1] https://tools.ietf.org/html/rfc318
>     [ page 8, "End of Line Convention" ]
> [2] https://tools.ietf.org/html/rfc2046#section-4.1.1
> [3] https://tools.ietf.org/html/rfc7230#section-3.5
> [4] https://tools.ietf.org/html/rfc959
> 
> -- 
> |)|/  Ryan Kavanagh      | GPG: 4E46 9519 ED67 7734 268F
> |\|\  https://rak.ac     |      BD95 8F7B F8FC 4A11 C97A

Link to individual message.

4. Ryan Kavanagh (rak (a) rak.ac)

Hi,

On Thu, Jun 04, 2020 at 12:23:26PM -0400, prisonpotato at tilde.team wrote:
> I disagree with this idea, as it adds a signifigant burden to both
> server implementations and client implementations running on unix
> systems.

The proposed modification *reduces* the burden for clients on all
systems. Indeed, clients are currently required to accept *both* CRLF
and bare LF as being representative of a line break. The proposed change
requires them only to accept CRLF, while giving them the option to also
accept LF if they so desire. This means: clients satisfying the spec now
will satisfy the spec after the change.

You are correct in that it increases the burden for servers (regardless
of the host system): they must convert bare LF endings to CRLF before
transmitting text. Whether or not this change imposes a "significant"
burden is subjective. For what it's worth, the gopher protocol specifies
CRLF line endings, and gophernicus manages to do this conversion with ~5
lines of code [0, 1].

The question boils down to a cost-benefit analysis of:

    preserving spec compliance for existing servers and not having
    servers worry about line endings

versus

    respecting network protocol conventions that have been established
    for decades and not violating a MUST requirement of RFC 2046 ?4.1.1

Best,
Ryan

[0] https://github.com/gophernicus/gophernicus/blob/master/src/file.c#L68
[1] https://github.com/gophernicus/gophernicus/blob/master/src/string.c#L122

-- 
|)|/  Ryan Kavanagh      | GPG: 4E46 9519 ED67 7734 268F
|\|\  https://rak.ac     |      BD95 8F7B F8FC 4A11 C97A

Link to individual message.

5. colecmac (a) protonmail.com (colecmac (a) protonmail.com)

> The proposed change requires them only to accept CRLF


This is a problem I think, because any clients that do this will
fail to properly display the vast majority on Gemini content. I don't
see the problem with mandating handling LF as well. Just split lines on
LF and you're done for both types.

makeworld

Link to individual message.

6. prisonpotato (a) tilde.team (prisonpotato (a) tilde.team)


I agree.  This seems like a more sensible solution

Link to individual message.

7. Ryan Kavanagh (rak (a) rak.ac)

After some thought and discussion, I'd like to retract my proposed
amendment. I misinterpreted that paragraph of the spec as implying
something that it isn't.

Best,
Ryan

-- 
|)|/  Ryan Kavanagh      | GPG: 4E46 9519 ED67 7734 268F
|\|\  https://rak.ac     |      BD95 8F7B F8FC 4A11 C97A
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 1873 bytes
Desc: not available
URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20200604/c265
7e75/attachment.sig>

Link to individual message.

8. Sean Conner (sean (a) conman.org)

It was thus said that the Great prisonpotato at tilde.team once stated:
> I disagree with this idea, as it adds a signifigant burden to both
> server implementations and client implementations running on unix
> systems.

  And I disagree with this disagreement.  Requiring only LF produces an
undue hardship on Windows systems which use both CR and LF.  And Windows is


  -spc (I can't believe I'm defending Windows here ... )

Link to individual message.

9. Matthew Graybosch (hello (a) matthewgraybosch.com)

On Thu, 4 Jun 2020 17:53:14 -0400
Sean Conner <sean at conman.org> wrote:

> And I disagree with this disagreement.  Requiring only LF produces
> an undue hardship on Windows systems which use both CR and LF.  And
> Windows is *still* the most popular operating system out there.

What if somebody decides to write a Gemini client for FreeDOS? IIRC,
that OS still uses CR and LF for line endings, just like MS-DOS did.
Wouldn't a LF-only requirement hamper such an effort?

-- 
Matthew Graybosch           https://www.matthewgraybosch.com
All opinions are my own.    Harrisburg, PA USA

"Out of order?! Even in the future nothing works!"

Link to individual message.

10. Sean Conner (sean (a) conman.org)

It was thus said that the Great Ryan Kavanagh once stated:
> 
> Here's [0,1] some of the history of requiring CRLF in network protocols
> and in requiring CRLF for text/ subtypes [2] during transmission.
> 
> [0] https://www.rfc-editor.org/old/EOLstory.txt
> [1] https://tools.ietf.org/html/rfc318
>     [ page 8, "End of Line Convention" ]

  Thank you for this.  I'm saving the references for future discussions on
this topic.

  -spc

Link to individual message.

11. defdefred (defdefred (a) protonmail.com)

Isn't it only a text editor issue?

Wordpad is working for unix like text file...


Sent with ProtonMail Secure Email.

??????? Original Message ???????
On Friday 5 June 2020 00:21, Matthew Graybosch <hello at matthewgraybosch.com> wrote:

> On Thu, 4 Jun 2020 17:53:14 -0400
> Sean Conner sean at conman.org wrote:
>
> > And I disagree with this disagreement. Requiring only LF produces
> > an undue hardship on Windows systems which use both CR and LF. And
> > Windows is still the most popular operating system out there.
>
> What if somebody decides to write a Gemini client for FreeDOS? IIRC,
> that OS still uses CR and LF for line endings, just like MS-DOS did.
> Wouldn't a LF-only requirement hamper such an effort?
>
> -------------------------------------------------------------------------
---------------------------------------------------------------------------
----------------------------------------------
>
> Matthew Graybosch https://www.matthewgraybosch.com
> All opinions are my own. Harrisburg, PA USA
>
> "Out of order?! Even in the future nothing works!"

Link to individual message.

---

Previous Thread: Lightweight Unicode Author Client Hinting - LUACH proposal

Next Thread: [ANN] New server: gemini://gemini.marmaladefoo.com/