💾 Archived View for gemi.dev › gemini-mailing-list › 000107.gmi captured on 2024-05-12 at 15:57:46. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-12-28)
-=-=-=-=-=-=-
Ahoy! The three month spec freeze announced, well, almost three months ago, will be expiring soon. Things to ponder/discuss have been piling up. So, I've been considering dealing with some of the "low hanging fruit" early (I have some time off work later this week because of a national holiday). I'm thinking in particular of fairly minor changes, where it is obvious that there is a problem in what's already specced or important functionality is missing, and where there are fairly obviously solutions. To this end, I'm going to outline some proposals below for feedback. I *hope* that these will be pretty uncontroversial. Feedback is welcome, as always, but we have to do *something* about these issues, so if you really think what I propose below is a bad idea, a better alternative would be a very good thing to bring to the discussion! Here we go, then... ISSUE 1: Problem: The current spec does not impose any limit on request header length. The status code and META field can be separated by arbitrarily many spaces and/or tabs. Malicious or buggy servers can hang or crash carelessly written clients by sending an infinite stream of whitespace. It's not clear *why* anybody would want to do this (a "reverse DOS attack" is not very useful!), but it's clearly a problem nevertheless. Proposal: Redfine response headers from: <STATUS><whitespace><META><CR><LF> to: <STATUS> <META><CR><LF> i.e. exactly one space character between <STATUS> and <META> Rationale: Allowing multiple whitespace characters of different kinds makes sense in, e.g., the link syntax of text/gemini - that has to be written and read by human content authors, so it's a good idea to accommodate different editor behaviours and different personal preferences for laying things out. But response headers are written and read by software, so there's no need to be so generous. Specifying the header format more precisely actually just makes life slightly easier for client authors. As a result of this, the maximum length of a response becomes finite (as the length of <STATUS> and <META> are already well defined elsewhere). Client authors who want to follow Postel's law won't need to make any changes here. I imagine many server authors also won't actually need to. The most probable scenario is no change needed (the server already sends one space) or a single s/\t/ / is neeed. ISSUE 2: Problem: The spec makes a big fuss about how text/gemini is line-oriented, but does not clearly state what exactly constitutes a line. The definition of link lines includes a <CR><LF> at the end but it's not clear if that applies to all line types - or whether I even meant to do this or it was a careless error. Proposal: Actually, it turns out this is decided for us. RFC2046, which defines the text/* MIME media type and the text/plain subtype covers this very clearly: --- 4.1.1. Representation of Line Breaks The canonical form of any MIME "text" subtype MUST always represent a line break as a CRLF sequence. Similarly, any occurrence of CRLF in MIME "text" MUST represent a line break. Use of CR and LF outside of line break sequences is also forbidden. This rule applies regardless of format or character set or sets involved. --- Since text/gemini is, well, text/gemini, it is a "text" subtype and using anything other than CRLF means we're violating the RFCs we're supposedly building on top of. So, CRLF everywhere it is. I propose it be mostly the server's job to handle this. Text editors on different operating systems used by content authors will use various different line break encodings which are beyond our control, so we can't really make it the author's job. Servers can translate LF to CRLF before sending content over the network. This way clients only need to handle the "canonical" format, no matter what authors do. Rationale: Don't break foundational RFCs. Yeah, I know, this is tedious and no fun for server authors, but, well, see above. ISSUE 3: Problem: There's no way to specify the (human) language a text/gemini document is written in. Proposal: Define a new parameter for the text/gemini MIME type (alongside the previously defined `charset`) to specify language. Following the example set by HTML, it seems natural to call the parameter `lang` and to allow values as per RFC1766, e.g.: text/gemini; charset=utf-8; lang=en text/gemini; charset=utf-8; lang=en-US text/gemini; charset=utf-8; lang=en-GB text/gemini; charset=utf-8; lang=es text/gemini; charset=utf-8; lang=fr Rationale: A protocol for a global network which targets human beings reading textual content as its first-class application shouldn't be Anglocentric! Gemini already has:
On Mon, May 18, 2020 at 08:35:44PM +0000, solderpunk wrote: Whoops, small mistake: > ISSUE 1: > > Problem: The current spec does not impose any limit on request header > length. This should say "response header", not "request header". Cheers, Solderpunk
These are good additions, thanks. I take issue with, maybe predictably, number 2. I didn't realize that RFC existed, and I was surprised to read it. Are there any other formats that actually follow this? Take markdown for example, plenty of people write markdown with only a \n at the end of each line, and it doesn't affect anything. In fact, in the markdown spec, it explicitly says just \n is fine: https://spec.commonmark.org/0.29/#line-ending I don't see the point in following the RFC in this case, I think it adds needless complication to the spec. Obviously it's not such a big change to server software, but I don't think there's a good reason to add it. > Rationale: Don't break foundational RFCs. It would be hard to argue with this, except that there seems to be a precedent of not caring about this specific part of this RFC. I think Gemini would be less surprising and more simple, if it disregarded this like most other specs seem to do. I'm in favor of defining a line ending as \r\n OR \n. Thoughts? makeworld
It was thus said that the Great solderpunk once stated: > ISSUE 3: > > Problem: There's no way to specify the (human) language a text/gemini document > is written in. > > Proposal: Define a new parameter for the text/gemini MIME type > (alongside the previously defined `charset`) to specify language. > Following the example set by HTML, it seems natural to call the > parameter `lang` and to allow values as per RFC1766, e.g.: > > text/gemini; charset=utf-8; lang=en > text/gemini; charset=utf-8; lang=en-US > text/gemini; charset=utf-8; lang=en-GB > text/gemini; charset=utf-8; lang=es > text/gemini; charset=utf-8; lang=fr What's a client to do if 'lang=' isn't there? Assume English? Assume nothing? -spc
On Mon, May 18, 2020 at 05:03:41PM -0400, Sean Conner wrote: > What's a client to do if 'lang=' isn't there? Assume English? Assume > nothing? Good question. Should clients which do something with this information (like screenreaders) have a default language setting which users can set? Cheers, Solderpunk
On Mon, May 18, 2020 at 05:03:41PM -0400, Sean Conner wrote: > What's a client to do if 'lang=' isn't there? Assume English? Assume > nothing? Specifying English, or any other language, as a default value doesn't seem right to me. Realistically, the majority of clients will not do anything with this information. Those which *do* use it in some way will know best what the most sensible behaviour is in the absence of explicit information. So I propose that client behaviour in the absence of a specified language is up to the client. The search engine case seems the trickiest to me (compared to, say, screen reading). Statistical language recognition is a thing, but doing it on every document in Geminispace is likely to be burdensome... Cheers, Solderpunk
On Mon, May 18, 2020 at 05:03:41PM -0400, Sean Conner wrote: > What's a client to do if 'lang=' isn't there? Assume English? Assume nothing? I'd think only the mimetype should be mandatory, and the rest will use defaults, when not specified... of course, spec shouldn't specify what the defaults are... it could also attempt to auto-detect and prompt user if it matters (normal text browsers will probably be indifferent, but audio browser could ask, and search engines could warn, which will incentivize users to put a language anyway), but that's a client-specific extra... I'm not sure I see the point in the encoding part, though... practically everything can be converted to utf8 rather easily, making it a bit useless to specify... another interesting point, what specification is the lang= tag? it should probably encouraged to use some special use codes too, taking ISO 639-2 as example (standard specifying three-letter codes for languages): mis, for "uncoded languages"; mul, for "multiple languages"; und, for "undetermined"; zxx, for "no linguistic content; not applicable"; where, AFAIK, "mis" would apply for languages not in the spec, probably stuff like Toki Pona "mul" wouldn't be that useful on its own, but maybe when "mul" is specified and there are specified multiple languages in addition, it could help? "zxx" would apply for art, because if you happen to have an ascii art gallery or so, there's no point in indexing it fully (you can just have a list of all the names on another page, to get them listed), and no point in reading it aloud... there's probably a somewhat similar part of whatever the spec is, or similar convention
On 20/05/18 08:35PM, solderpunk wrote: > ISSUE 2: > > Problem: The spec makes a big fuss about how text/gemini is > line-oriented, but does not clearly state what exactly constitutes a > line. The definition of link lines includes a <CR><LF> at the end but > it's not clear if that applies to all line types - or whether I even > meant to do this or it was a careless error. > > Proposal: Actually, it turns out this is decided for us. RFC2046, > which defines the text/* MIME media type and the text/plain subtype > covers this very clearly: > > --- > 4.1.1. Representation of Line Breaks > > The canonical form of any MIME "text" subtype MUST always represent a > line break as a CRLF sequence. Similarly, any occurrence of CRLF in > MIME "text" MUST represent a line break. Use of CR and LF outside of > line break sequences is also forbidden. > > This rule applies regardless of format or character set or sets > involved. > --- > > Since text/gemini is, well, text/gemini, it is a "text" subtype and > using anything other than CRLF means we're violating the RFCs we're > supposedly building on top of. > > So, CRLF everywhere it is. > > I propose it be mostly the server's job to handle this. Text editors > on different operating systems used by content authors will use > various different line break encodings which are beyond our control, > so we can't really make it the author's job. Servers can translate LF > to CRLF before sending content over the network. This way clients > only need to handle the "canonical" format, no matter what authors do. > > Rationale: Don't break foundational RFCs. > > Yeah, I know, this is tedious and no fun for server authors, but, well, > see above. My only concern with this is the "server's job" part. I'd rather not have my server transform user-supplied content, even if it's something as minor as line breaks. Apache doesn't attempt to fix invalid HTML, why should SecretShop fix invalid text/gemini? Seems to me this should be handled by something like the gemini vim-syntax plugin. It also makes writing servers a bit more complicated since text/gemini has to be treated differently from other files and actually parsed versus being directly served up. Not the biggest deal (and you've already admited it's tedious) but just something I noticed. > > ISSUE 4: > > Problem: Name-based virtual hosting is explicitly described as being > supported in the spec, but no mention is made of SNI (Server Name > Indication, a TLS extension which puts the desired server hostname in > the TLS handshake). Without this, virtual hosting can't be made to > work reliably. > > Proposal: Mandate use of SNI by clients. > SecretShop implements virtual-hosting with the assumption that clients are using SNI, so I'm in favour. -Steve
May 19, 2020 12:51 AM, "Steve Ryan" <stryan at saintnet.tech> wrote: > My only concern with this is the "server's job" part. I'd rather not > have my server transform user-supplied content, even if it's something > as minor as line breaks. Apache doesn't attempt to fix invalid HTML, why > should SecretShop fix invalid text/gemini? Seems to me this should be > handled by something like the gemini vim-syntax plugin. > > It also makes writing servers a bit more complicated since text/gemini > has to be treated differently from other files and actually parsed > versus being directly served up. Not the biggest deal (and you've > already admited it's tedious) but just something I noticed. well, teeeechnically it'd have to do this for all text/* files, according to the rfc, including plaintext and whatnot... although THAT's unlikely to happen... I'd also be in favor of server handling it, although that is a kinda-valid point... html doesn't count because the browsers will do their best to fix all kinds of TOTALLY BROKEN html, you can have partial tags, no end tags for some, etc, and the browser will never tell the user your site is a trashpile, it will just silently try its best to fix it up (even in inspect element, you'd see the "repaired" version)...
On Mon, May 18, 2020 at 10:00:56PM +0000, jan6 at tilde.ninja wrote: > May 19, 2020 12:51 AM, "Steve Ryan" <stryan at saintnet.tech> wrote: > > My only concern with this is the "server's job" part. I'd rather not > > have my server transform user-supplied content, even if it's something > > as minor as line breaks. Apache doesn't attempt to fix invalid HTML, why > > should SecretShop fix invalid text/gemini? Seems to me this should be > > handled by something like the gemini vim-syntax plugin. > > > > It also makes writing servers a bit more complicated since text/gemini > > has to be treated differently from other files and actually parsed > > versus being directly served up. Not the biggest deal (and you've > > already admited it's tedious) but just something I noticed. > > well, teeeechnically it'd have to do this for all text/* files, according to the rfc, including > plaintext and whatnot... although THAT's unlikely to happen... > > I'd also be in favor of server handling it, although that is a kinda-valid point... > html doesn't count because the browsers will do their best to fix all kinds of TOTALLY BROKEN html, > you can have partial tags, no end tags for some, etc, and the browser will never tell the user your > site is a trashpile, it will just silently try its best to fix it up (even in inspect element, > you'd see the "repaired" version)... https://tools.ietf.org/html/rfc5147#section-4.1 " In Internet messages, line endings in text/plain MIME entities are represented by CR+LF character sequences (see RFC 2046 [3] and RFC 3676 [6]). However, some protocols (such as HTTP) additionally allow other conventions for line endings." precedent of protocols allowing for other line endings, so the gemini spec could just say the same thing that http does about line endings. https://tools.ietf.org/html/rfc2616#section-3.7.1 " When in canonical form, media subtypes of the "text" type use CRLF as the text line break. HTTP relaxes this requirement and allows the transport of text media with plain CR or LF alone representing a line break when it is done consistently for an entire entity-body. HTTP applications MUST accept CRLF, bare CR, and bare LF as being representative of a line break in text media received via HTTP."
It was thus said that the Great jan6 at tilde.ninja once stated: > On Mon, May 18, 2020 at 05:03:41PM -0400, Sean Conner wrote: > > What's a client to do if 'lang=' isn't there? Assume English? Assume nothing? > > I'd think only the mimetype should be mandatory, and the rest will use defaults, when not > specified... > of course, spec shouldn't specify what the defaults are... > > it could also attempt to auto-detect and prompt user if it matters (normal text browsers will > probably be indifferent, but audio browser could ask, and search engines could warn, which will > incentivize users to put a language anyway), but that's a client-specific extra... I thought about autodetection---Unicode is defined in blocks, where each alphabet becomes a defined block in Unicode. I then realized that there are multiple languages that use the European block. Sure, detecting Greek is easy since they have their own alphabet, but what about Spanish, French and German? They use the same alphabet. Nice idea, but there are some tough issues to address. > I'm not sure I see the point in the encoding part, though... > practically everything can be converted to utf8 rather easily, making it a bit useless to > specify... Think legacy documents. And not every legacy encoding scheme can round trip through Unicode---I recall there being issues with several east Asian languages (Chinese, Japanese in particular). > another interesting point, what specification is the lang= tag? Solderpunk mentioned RFC-1766, which uses the two letter standard for languages. > it should probably encouraged to use some special use codes too, taking ISO 639-2 as example > (standard specifying three-letter codes for languages): > mis, for "uncoded languages"; > mul, for "multiple languages"; > und, for "undetermined"; > zxx, for "no linguistic content; not applicable"; I buy that. -spc
Glad to hear about SNI being in the radar. It's a must for virtual hosting. Any thoughts about SNI interaction with the current "host in request URL is like Host header" in the spec? Since SNI does the virtual hosting part (and better) it would only be useful for proxying other hosts AFAICT. Is proxying allowed currently in any server? Is it even desirable in the protocol? Or is it just an idea that ossified in the spec without real world use? (Genuine questions! I don't see the use but in sure it's been discussed and I'm just late to the party.) Most servers (all I've tried, circumlunar.space included) fail to handle host-less requests (out of spec) and deny proxying other hosts. And I'm pretty sure clients are adapting to this behavior. I'm afraid this will end up being the de facto standard even with SNI making it obsolete. Cheers, kaoD -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20200519/0a7d 3715/attachment.htm>
It was thus said that the Great kaoD once stated: > Glad to hear about SNI being in the radar. It's a must for virtual hosting. Yup. At least two servers that I am aware of implement it, GLV-1.12556 (I wrote this one) and gemserv. > Any thoughts about SNI interaction with the current "host in request URL is > like Host header" in the spec? It's how GLV-1.12556 determines what set of content handlers to look through when serving up a request. The public server I have doesn't serve multiple hosts, but the code running it can. > Since SNI does the virtual hosting part (and > better) it would only be useful for proxying other hosts AFAICT. > > Is proxying allowed currently in any server? I saw something about gemserv supporting proxying, but I don't know the details of how it works. As for proxy support in GLV-1.12556 (the only server I can speak authoritatively about), it would be easy to write a handler to support a proxy like: gemini://gemini.conman.org/proxy/mozz.us/journal/2020-05-06.gmi But if by "proxy" you mean you connect to gemini.conman.org and expect the request itself to be proxied: gemini://mozz.us/journal/2020-05-06.gmi that ... could be done, but it would require two things---1) your client would have to know to use the gemini.conman.org certificate to connect to my server and 2) my server would have to know to proxy this domain (and supporting that type of proxy in GLV-1.12556 would require some thought---the server isn't set up for that type of thing [1]). > Is it even desirable in the > protocol? Or is it just an idea that ossified in the spec without real > world use? (Genuine questions! I don't see the use but in sure it's been > discussed and I'm just late to the party.) When the suggestion to use the URL as the request (which would give us multidomain support with a server), solderpunk also saw a proxy being easy to implement without thinking about the implications. > Most servers (all I've tried, circumlunar.space included) fail to handle > host-less requests (out of spec) and deny proxying other hosts. > > And I'm pretty sure clients are adapting to this behavior. I'm afraid this > will end up being the de facto standard even with SNI making it obsolete. Huh? I don't understand the concern here. -spc [1] Nor does it support multiple domains with a single certificate/key pair. Right now, each server requires its own certificate/key file.
It was thus said that the Great Sean Conner once stated: > But if by "proxy" you mean you connect to gemini.conman.org and expect the > request itself to be proxied: > > gemini://mozz.us/journal/2020-05-06.gmi > > that ... could be done, but it would require two things---1) your client > would have to know to use the gemini.conman.org certificate to connect to my > server and 2) my server would have to know to proxy this domain (and > supporting that type of proxy in GLV-1.12556 would require some > thought---the server isn't set up for that type of thing [1]). I just thought of a way it could be done---I could have a copy of the certificate/key pair for the domain I'm proxying for. Then it would be easy to write a handler to forward the request to the original domain and return the contents. -spc > [1] Nor does it support multiple domains with a single certificate/key > pair. Right now, each server requires its own certificate/key file.
I was also wondering about how autodetection would work. Instead of autodetection how about having the language be on the first line in a file and have the server parse that for the information? Sorry if that was already discussed. int 80
On Mon, May 18, 2020 at 9:04 PM kaoD <elkaod at gmail.com> wrote: > Is proxying allowed currently in any server? Is it even desirable in the protocol? Or is it just an idea that ossified in the spec without real world use? (Genuine questions! I don't see the use but in sure it's been discussed and I'm just late to the party.) Gemserv has reverse proxy support. That's what's being used to proxy gemini to gopher on my site. int 80h
> > Glad to hear about SNI being in the radar. It's a must for virtual hosting. > Yup. At least two servers that I am aware of implement it, GLV-1.12556 (I wrote this one) and gemserv. Jetforce does as well, I'm not sure about others. makeworld
On Mon, May 18, 2020 at 09:45:35PM +0000, jan6 at tilde.ninja wrote: > I'm not sure I see the point in the encoding part, though... > practically everything can be converted to utf8 rather easily, making it a bit useless to > specify... That's been in the spec since early days, the only proposed change here is the language part. The charset parameter is explicitly optional and the default in its absence is UTF-8, which clients MUST support. Support for additional encodings in clients is optional, but they should fail gracefully in the face of things they can't handle. I expect 90%+ of content to be UTF-8 and for servers not to bother specifying this as it's the explicit default, but it seemed unwise to be completely incapable of handing anything else. Cheers, Solderpunk
On Mon, May 18, 2020 at 08:07:53PM -0400, Sean Conner wrote: > Sure, detecting Greek is > easy since they have their own alphabet, but what about Spanish, French and > German? They use the same alphabet. I don't think it's viable for interactive user clients (especially light and simple ones) to attempt this, but in the context of, say, a search engine which really wants to categorise everything (which is not to say that GUS necessarily has to shoulder this burden!), even distinguishing languages with the same alphabet is possible by looking at bigram and trigram frequencies if there's enough text. German text will have many more occurences of "lich" and "heit" than French or Spanish, etc. > Nice idea, but there are some tough issues to address. Yeah, this language proposal may have been poorly categorised as "quick and easy" compared to the others. Cheers, Solderpunk
Sean Conner writes: > I thought about autodetection---Unicode is defined in blocks, where each Please, _please_ don't require autodetection of lanugage or anything else. Gopher required autodetection of charsets and content type, which was horrible. One of the best things about the current protocol is that it makes these things explicit. Tim -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 487 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20200519/f850 a214/attachment.sig>
On Tue, May 19, 2020 at 03:04:25AM +0200, kaoD wrote: > Is proxying allowed currently in any server? Is it even desirable in the > protocol? Or is it just an idea that ossified in the spec without real > world use? (Genuine questions! I don't see the use but in sure it's been > discussed and I'm just late to the party.) I used the proxying capabilities of Gemini to write a Gopher-to-Gemini proxy named Agena: https://tildegit.org/solderpunk/agena It is basically a Gemini server, which when it receives requests for gopher:// URLs, downloads the content from Gopher, and, in the case of Gopher menus, converts them to text/gemini menus and replies with that, and in the case of anything else does MIME type and (for text/*) charset detection, and replies with that with the appropriate header. AV-98 lets you specify the host and port of such a proxy and if you try to follow a link into Gopherspace, it routes it through proxy. This makes it possible to seamlessly surf Geminispace and Gopherspace from within the same client with a consistent interface without requiring client authors to go through the trouble of adding support for both protocols. I think this is a super neat idea, but as far as I know AV-98 is the only client which supports it and, since I wrote Agena, multiple other clients have actually gained native Gemini and Gopher support (like Bombadillo and Castor), so I guess the incentive is reduced (which is a shame, because there are at least two public Agena proxies running, as far as I know, one of which only went up very recently). But I think it demonstrates that the proxying idea does have really interesting and useful applications. If you wanted to surf Geminispace from work during your lunch break but your employer's firewall disallows outgoing traffic on port 1965, you could setup a proxying Geminiserver on a server you control which listens on a port other than 1965 and restrict access to it by requiring a client certificate which is on a whitelist, and whitelist only your work machine(s), to work around this. I'm sure other examples exist, too. Cheers, Solderpunk
On Mon, May 18, 2020 at 05:37:15PM -0500, epoch wrote: > https://tools.ietf.org/html/rfc5147#section-4.1 > > " In Internet messages, line endings in text/plain MIME entities are > represented by CR+LF character sequences (see RFC 2046 [3] and RFC > 3676 [6]). However, some protocols (such as HTTP) additionally allow > other conventions for line endings." > > precedent of protocols allowing for other line endings, so the gemini spec > could just say the same thing that http does about line endings. > > https://tools.ietf.org/html/rfc2616#section-3.7.1 > > " When in canonical form, media subtypes of the "text" type use CRLF as > the text line break. HTTP relaxes this requirement and allows the > transport of text media with plain CR or LF alone representing a line > break when it is done consistently for an entire entity-body. HTTP > applications MUST accept CRLF, bare CR, and bare LF as being > representative of a line break in text media received via HTTP." Aaah, wonderful! Thanks for bringing this to my attention. I *had* been wondering whether or not mainstream webservers were doing newline translation for text/html, which I had convinced myself they should have been doing, in principle. Okay, so we can actually spec whatever we like here for stuff that goes over the wire. CRLF is only mandatory for "canonical form" (although if that's not what is on the disk and not what is on the wire, what the heck *is* it?). So now the question is which do we prefer, forcing servers to canonicalise (canonise?) everything, with client behaviour then being strictly limited, or letting servers just send whatever the author uses and forcing clients to accept any of CRLF, CR or LF? In general I have preferred pushing burden to server authors over client authors for Gemini...but I expect the client burden is pretty small here as most languages will have some kind of functionality for breaking strings into lines which already handles this... Cheers, Solderpunk
On Tuesday 19 May 2020 02:07, Sean Conner <sean at conman.org> wrote: > I thought about autodetection---Unicode is defined in blocks, where each > alphabet becomes a defined block in Unicode. I then realized that there are > multiple languages that use the European block. Sure, detecting Greek is > easy since they have their own alphabet, but what about Spanish, French and > German? They use the same alphabet. Autodetection is necessary for document using multiple languages. Browser preference is fine for hard to detect case. > > I'm not sure I see the point in the encoding part, though... > > practically everything can be converted to utf8 rather easily, making it a bit useless to > > specify... > > Think legacy documents. And not every legacy encoding scheme can round > trip through Unicode---I recall there being issues with several east Asian > languages (Chinese, Japanese in particular). UTF-8 is capable for all existing language and more :-). Legacy document if not UTF-8 converted are just out of the gemeni browser scope. Must gemini browser be able to display every kind of document ? Regards, freD.
May 19, 2020 10:46 AM, "solderpunk" <solderpunk at sdf.org> wrote: > So now the question is which do we prefer, forcing servers to > canonicalise (canonise?) everything, with client behaviour then being > strictly limited, or letting servers just send whatever the author uses > and forcing clients to accept any of CRLF, CR or LF? > > In general I have preferred pushing burden to server authors over client > authors for Gemini...but I expect the client burden is pretty small here > as most languages will have some kind of functionality for breaking > strings into lines which already handles this... > > Cheers, > Solderpunk well, if it's ONLY CRLF, then it enables making basic clients in *ANY* language, including pure assembly, if someone were brave enough (well, the core part, you'd probably write the TLS part in C or something, still) maybe change the spec slightly, so, after the header, there's CRLF, immediately after which is an empty line with the line ending that applies for the rest of the document, and require at least one non-empty line after that? that wouldn't matter for more advanced languages, which can just ignore that and use builtin functions, but for more basic ones, it makes it easy to read, "is it LF, else is it CR followed by LF, else assume CR"... while it would be possible to do that with normal content lines too, it'd be harder, as you can't just "eat" a byte or two from the start...
On Tuesday 19 May 2020 09:20, solderpunk <solderpunk at SDF.ORG> wrote: > I don't think it's viable for interactive user clients (especially light > and simple ones) to attempt this, but in the context of, say, a search > engine which really wants to categorise everything (which is not to say > that GUS necessarily has to shoulder this burden!), even distinguishing > languages with the same alphabet is possible by looking at bigram and > trigram frequencies if there's enough text. German text will have many > more occurences of "lich" and "heit" than French or Spanish, etc. Agree and french have ???, spanish ?? and german ? :-) Nice to have UTF-8 to display all of them in the same document...
On Tuesday 19 May 2020 09:29, plugd <plugd at thelambdalab.xyz> wrote: > Please,please don't require autodetection of lanugage or anything > else. Gopher required autodetection of charsets and content type, which > was horrible. One of the best things about the current protocol is that > it makes these things explicit. Having multiple charset is horrible :-) UTF-8 will rule them all and converter will clean things up. Historical document with ancien encoding should be displayed using external viewer.
I'd LOVE if there was a way to indicate multiple languages in one document... if you write parts in different languages (maybe are writing articles about conlangs, or want to have some quotes from another language, or have language tutorials or comparisons or something), then it would be really useful to allow signifying the change... and as Nicole said in another thread, > FWIW, this is also a necessity for writing both Chinese and Japanese in the same document, due to the Han unification of Unicode. would need to be another block type thingy, probably... while it's complicating the spec a little, it'd be quite useful for some people...
On Tue, May 19, 2020 at 08:12:17PM +0000, jan6 at tilde.ninja wrote: > while it's complicating the spec a little, it'd be quite useful for some people... Yeah, but *everything*'s quite useful for some people, and individually each little useful thing is not *that* much more complicated, and then one day... Which is not to say nothing like this will ever happen, I actually think that really good multilingual support is a lot more important than many other things and might be worth the cost. But I've definitely changed my mind that this is a quick and easy thing which can be snuck in along with defining lining breaks and SNI and stuff. We can take our time to figure out a good solution to this. Cheers, Solderpunk
I will say that "allowing someone to send a lang flag per-page" gets you the vast majority of the way there, and is a necessity. Changing languages is an addition to think about later, imo. Nicole -------- Original Message -------- On May 19, 2020, 14:05, solderpunk wrote: > On Tue, May 19, 2020 at 08:12:17PM +0000, jan6 at tilde.ninja wrote: > >> while it's complicating the spec a little, it'd be quite useful for some people... > > Yeah, but *everything*'s quite useful for some people, and individually > each little useful thing is not *that* much more complicated, and then > one day... > > Which is not to say nothing like this will ever happen, I actually think > that really good multilingual support is a lot more important than many > other things and might be worth the cost. But I've definitely changed > my mind that this is a quick and easy thing which can be snuck in along > with defining lining breaks and SNI and stuff. We can take our time to > figure out a good solution to this. > > Cheers, > Solderpunk -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20200519/8abd cf74/attachment.htm>
I wonder how a native RTL reader should format a gemini document? Should #Big Title Become: eltiT giB# or: #eltiT giB :-) Regards, freD. -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20200519/7e78 a644/attachment.htm>
On Tue, May 19, 2020 at 07:46:25AM +0000, solderpunk wrote: > So now the question is which do we prefer, forcing servers to > canonicalise (canonise?) everything, with client behaviour then being > strictly limited, or letting servers just send whatever the author uses > and forcing clients to accept any of CRLF, CR or LF? > > In general I have preferred pushing burden to server authors over client > authors for Gemini...but I expect the client burden is pretty small here > as most languages will have some kind of functionality for breaking > strings into lines which already handles this... Some quick experimentation with Python revealed that str.splitlines() is, indeed, smart enough to handle CRLF, CR and LF as linebreaks. Encouraged by this, I made a single Gemini page that mixes all three styles in a single document: gemini://gemini.circumlunar.space/users/solderpunk/linebreaks.gmi I visited it with a selection of clients, but the results weren't encouraging. It rendered just fine in AV-98, because apparently Python is awesome at this. Curiously, McCross, also in Python, could handle CR fine, split CRLF into lines but displayed some stray characters at the end of lines, and did not split LF lines at all. Perhaps this is because it's a GUI client and is relying on Tkinter to redner text? Bombadillo and Castor (an old build of Castor I had lying around, admittedly) both handled CR and CRLF fine, but did not split LF lines at all. The same was true of my tiny demo Lua and Go clients. So, speccing any line ending as permissble, like HTML does, would seemingly immediately render most clients out-of-spec. Specifying CR or CRLF but not LF would require the minimum amount of rework, but it would be hard to justify this by anything other than laziness. We either go strict, and require CRLF only, or we go permissive, and allow anything, but we can't really pick and mix for the sake of convenience. We basically need to choose between forcing server authors to normalise all endings to CRLF or forcing client authors to recgonise LF (even though it'll probably never be seen in the wild). Neither really thrill me, but, well, here we are. Cheers, Solderpunk
It was thus said that the Great solderpunk once stated: > > So, speccing any line ending as permissble, like HTML does, would > seemingly immediately render most clients out-of-spec. Please, *please*, *PLEASE* do not let this disuade you from making most clients out of spec. We have had to suffer terrible things because of this thinking (like Makefiles and tabs---"when I realized my mistake, there were already 10 people using it"). > Specifying CR or CRLF but not LF would require the minimum amount of > rework, but it would be hard to justify this by anything other than > laziness. And bizareness. The last time I worked on any systems that used only CR was back in the 80s. Today everybody uses either CRLF (Windows) or LF (Linux, Mac OS-X, whatever remaining bits of Unix are still around). > We basically need to choose between forcing server authors to normalise > all endings to CRLF or forcing client authors to recgonise LF (even > though it'll probably never be seen in the wild). Would that be only for text/gemini? Or all of the text/* formats? -spc
May 21, 2020 11:47 PM, "solderpunk" <solderpunk at sdf.org> wrote: > We basically need to choose between forcing server authors to normalise > all endings to CRLF or forcing client authors to recgonise LF (even > though it'll probably never be seen in the wild). > > Neither really thrill me, but, well, here we are. "it'll probably never be seen in the wild" why? just LF is the standard linux line ending, and if it's allowed, I'm fairly certain there will be several servers which would serve the files as-is, and given they'd likely be on some sort of unix machine, likely be ending with LF CR on the other hand, is very unlikely to be found in the wild, since it's used by any living OS, afaik, and also should, in theory, just move cursor to the start of line, instead of putting a newline, at least on linux terminals... which would overwrite the line...
On Thu, May 21, 2020 at 08:56:57PM +0000, jan6 at tilde.ninja wrote: > May 21, 2020 11:47 PM, "solderpunk" <solderpunk at sdf.org> wrote: > > We basically need to choose between forcing server authors to normalise > > all endings to CRLF or forcing client authors to recgonise LF (even > > though it'll probably never be seen in the wild). > > > > Neither really thrill me, but, well, here we are. > > "it'll probably never be seen in the wild" why? > just LF is the standard linux line ending, and if it's allowed, I'm fairly certain there will be several servers which would serve the files as-is, and given they'd likely be on some sort of unix machine, likely be ending with LF > > CR on the other hand, is very unlikely to be found in the wild, since it's used by any living OS, afaik, and also should, in theory, just move cursor to the start of line, instead of putting a newline, at least on linux terminals... which would overwrite the line... Argh, yep, sorry. Exchange *all* instances of lone CR and lone LF in my entire last email, I mixed them up. Cheers, Poldersunk
On 21-May-2020 21:47, solderpunk wrote: > On Tue, May 19, 2020 at 07:46:25AM +0000, solderpunk wrote: >> So now the question is which do we prefer, forcing servers to >> canonicalise (canonise?) everything, with client behaviour then being >> strictly limited, or letting servers just send whatever the author uses >> and forcing clients to accept any of CRLF, CR or LF? >> >> In general I have preferred pushing burden to server authors over client >> authors for Gemini...but I expect the client burden is pretty small here >> as most languages will have some kind of functionality for breaking >> strings into lines which already handles this... > Some quick experimentation with Python revealed that str.splitlines() > is, indeed, smart enough to handle CRLF, CR and LF as linebreaks. > Encouraged by this, I made a single Gemini page that mixes all three > styles in a single document: > > gemini://gemini.circumlunar.space/users/solderpunk/linebreaks.gmi > > I visited it with a selection of clients, but the results weren't > encouraging. > > It rendered just fine in AV-98, because apparently Python is awesome at > this. As a client author, I think we have to expect a mix of line endings, but perhaps not commonly in the same document. But even this will happen at some stage when you have multiple authors contributing to the same file. Most text editors will handle this without complaint. I think it is not to be surprising that servers will generally just serve up the GMI file as is. End users/authors will simply use whatever text editor and os defaults it uses. You can normalise them very simply just before display as follows (or similar scheme) - Replace all CRLF with LF in the document. - Replace all plain CR with LF - Then display, splitting at LF - Job done. This is not onerous or a big performance hit as GMI files are in the grand scheme of things *very small* text files. Works for me in my WIP client as a first approximation - the above link looks fine. But if the language has a library for this, we should just use it. Best Wishes - Luke
On Thu, May 21, 2020 at 04:54:47PM -0400, Sean Conner wrote: > Please, *please*, *PLEASE* do not let this disuade you from making most > clients out of spec. We have had to suffer terrible things because of this > thinking (like Makefiles and tabs---"when I realized my mistake, there were > already 10 people using it"). Point taken. > > Specifying CR or CRLF but not LF would require the minimum amount of > > rework, but it would be hard to justify this by anything other than > > laziness. > > And bizareness. The last time I worked on any systems that used only CR > was back in the 80s. Today everybody uses either CRLF (Windows) or LF > (Linux, Mac OS-X, whatever remaining bits of Unix are still around). Yep, as stated, this should have been "LR or CRLF". Speccing that would require minimal changes to servers (which could just serve files verbatim and assume nobody is generating CR-only files in this day and age), and clients (most of which could handle these two with either no trouble or minimal trouble). But doesn't just picking two out of the three seem lazy? Or is that just me? Is it okay in 2020 to just write bare CR out of existence? > > We basically need to choose between forcing server authors to normalise > > all endings to CRLF or forcing client authors to recgonise LF (even > > though it'll probably never be seen in the wild). > > Would that be only for text/gemini? Or all of the text/* formats? Ugh. Let me think... Cheers, Solderpunk
On Thu, 21 May 2020, solderpunk wrote: > We basically need to choose between forcing server authors to normalise > all endings to CRLF or forcing client authors to recgonise LF (even > though it'll probably never be seen in the wild). Could we have a bit of a breather to allow the implications to sink in, and, critically, to allow the development of conformance testing tools? If there were a tool which could be run on a document, that confirmed that it was conformant, and a similar tool for server behaviour, and people had had some time to try to integrate these with the existing software, it'd be easier to assess the tradeoffs involved in the spec decision. Mk -- Martin Keegan, +44 7779 296469, @mk270, https://mk.ucant.org/
Le 21 mai 2020 ? 22:47, solderpunk <solderpunk at sdf.org> a ?crit : > Bombadillo and Castor (an old build of Castor I had lying around, > admittedly) both handled CR and CRLF fine, but did not split LF lines at > all. The same was true of my tiny demo Lua and Go clients. FWIW this wouldn?t be an issue to change in Castor as I had to add code to split on CLRF explicitly if I recall correctly.
In the CR, LF , and CRLF debate I find myself in favor of not worrying about the relic that is CR only. No modern systems use this and as a new protocol I dont see a compelling reason to support ancient systems. Particularly given that even among ancient systems just a CR as a line ending was a rarity. (The below quote has been edited to provide the character that was actually meant, per solderpunk's later post): > Bombadillo and Castor (an old build of Castor I had lying around, > admittedly) both handled LF and CRLF fine, but did not split CR lines at > all. Bombadillo should be able to handle two of the three. For dealing with lines I split on newline and remove trailing whitespace from all lines when rendering. Bombadillo will also never render a `\r` character (the line wrapping parses them out of the render for display purposes). I imagine Castor and Asuka are doing something vaguely similar. > Some quick experimentation with Python revealed that str.splitlines() > is, indeed, smart enough to handle CRLF, CR and LF as linebreaks. That _is_ pretty cool that the python method was able to detect \r line endings and split them though. Points to python on that one.
I think what Martin is saying is some of what I was trying to convey at with my "gemini-fmt idea" proposal on another thread? that Gemini content authors run their texts through a spec conforming tool before a text ever reaches the servers and clients of Geminispace. Whatever combination of CR, LF or CR LF is in the pre-conformed text it will be correct when it gets on the server and correct when it gets to the client if run through such a tool. The people who are writing the server and client software are doing a huge service to the Gemini community, so please don't saddle them with the admittedly tedious work of writing code to check in the server and then check again in the client if it is an improper combination of CR, LF or CR LF and then even more code to re-conform it. That kind of thing really seems to go against the 100 lines of code, code it in a weekend Gemini spirit in my opinion. Thanks for your consideration. J On Thu, May 21, 2020, 16:18 Martin Keegan <martin at no.ucant.org> wrote: > On Thu, 21 May 2020, solderpunk wrote: > > > We basically need to choose between forcing server authors to normalise > > all endings to CRLF or forcing client authors to recgonise LF (even > > though it'll probably never be seen in the wild). > > Could we have a bit of a breather to allow the implications to sink in, > and, critically, to allow the development of conformance testing tools? > > If there were a tool which could be run on a document, that confirmed that > it was conformant, and a similar tool for server behaviour, and people > had had some time to try to integrate these with the existing > software, it'd be easier to assess the tradeoffs involved in the spec > decision. > > Mk > > -- > Martin Keegan, +44 7779 296469, @mk270, https://mk.ucant.org/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20200521/a6c3 faea/attachment.htm>
On 21-May-2020 22:17, Martin Keegan wrote: > On Thu, 21 May 2020, solderpunk wrote: > >> We basically need to choose between forcing server authors to normalise >> all endings to CRLF or forcing client authors to recgonise LF (even >> though it'll probably never be seen in the wild). > > Could we have a bit of a breather to allow the implications to sink > in, and, critically, to allow the development of conformance testing > tools? > > If there were a tool which could be run on a document, that confirmed > that it was conformant, and a similar tool for server behaviour, and > people had had some time to try to integrate these with the existing > software, it'd be easier to assess the tradeoffs involved in the spec > decision. I would think Postels rule should pragmatically apply to line endings in the response body, but the spec should definitely be very specific about line endings in any headers (as is http). But those are generated by the server anyway. If we force one line ending kind on authors, it will be a deterrent to them forging ahead with writing user content if they have to use some tool just to be able to get the content onto the server. Look at XHTML, it was a resounding failure and rejected by authors even though there were some good (and ill conceived) intentions of the spec writers. Best Wishes - Luke
On Thu, May 21, 2020 at 10:28:44PM +0100, Luke Emmet wrote: > I would think Postels rule should pragmatically apply to line endings in the > response body, but the spec should definitely be very specific about line > endings in any headers (as is http). But those are generated by the server > anyway. No question, the response header ends in CRLF as per internet spec convention. This discussion is purely about the response body. Cheers, Solderpunk
On Thu, 21 May 2020, Luke Emmet wrote: > If we force one line ending kind on authors, it will be a deterrent to them > forging ahead with writing user content if they have to use some tool just to > be able to get the content onto the server. Look at XHTML, it was a > resounding failure and rejected by authors even though there were some good > (and ill conceived) intentions of the spec writers. I believe there's a difference between requiring servers only to use CRLF in the body, and content authors only to use CRLF. If it were specified that servers "MUST NOT" send text/gemini bodies that use line separators other than CRLF, then it remains up to server implementers to choose whether 1) to require content authors to develop or save their content with CRLFs, or 2) to translate the content themselves (either on the fly, or by caching a fettled version with the right line separators). This is why I keep banging on about having a gemini-check tool for file formats (though I understand someone may now have written one up in Go). Mk -- Martin Keegan, +44 7779 296469, @mk270, https://mk.ucant.org/
On Thu, May 21, 2020 at 04:28:19PM -0500, ??? wrote: > The people who are writing the server and client software are doing a huge > service to the Gemini community, so please don't saddle them with the > admittedly tedious work of writing code to check in the server and then > check again in the client if it is an improper combination of CR, LF or CR > LF and then even more code to re-conform it. I'd rather make life a little harder for developers - who are technical people who know what a CR and LF are and are, anyway, signing up for a bit of fiddly detail by undertaking to implement an internet protocol from scratch - than make life harder for content authors, who may have no idea what this nonsense is all about and are arguably doing an even bigger service for the community by providing something to use our present abundance of servers and clients to read. Sorry if this seems blunt! Cheers, Solderpunk > seems to go against the 100 lines of code, code it in a weekend Gemini > spirit in my opinion. > > Thanks for your consideration. > > J > > On Thu, May 21, 2020, 16:18 Martin Keegan <martin at no.ucant.org> wrote: > > > On Thu, 21 May 2020, solderpunk wrote: > > > > > We basically need to choose between forcing server authors to normalise > > > all endings to CRLF or forcing client authors to recgonise LF (even > > > though it'll probably never be seen in the wild). > > > > Could we have a bit of a breather to allow the implications to sink in, > > and, critically, to allow the development of conformance testing tools? > > > > If there were a tool which could be run on a document, that confirmed that > > it was conformant, and a similar tool for server behaviour, and people > > had had some time to try to integrate these with the existing > > software, it'd be easier to assess the tradeoffs involved in the spec > > decision. > > > > Mk > > > > -- > > Martin Keegan, +44 7779 296469, @mk270, https://mk.ucant.org/ > >
On 21-May-2020 22:50, Martin Keegan wrote: > On Thu, 21 May 2020, Luke Emmet wrote: > >> If we force one line ending kind on authors, it will be a deterrent >> to them forging ahead with writing user content if they have to use >> some tool just to be able to get the content onto the server. Look at >> XHTML, it was a resounding failure and rejected by authors even >> though there were some good (and ill conceived) intentions of the >> spec writers. > > I believe there's a difference between requiring servers only to use > CRLF in the body, and content authors only to use CRLF. If it were > specified that servers "MUST NOT" send text/gemini bodies that use > line separators other than CRLF, then it remains up to server > implementers to choose whether > > 1) to require content authors to develop or save their content with > CRLFs, or > > 2) to translate the content themselves (either on the fly, or by > caching a fettled version with the right line separators). I see better now what the alternative suggestions are, thanks. I still think a robust client has to in all likelihood anticipate both CRLF and LF. There will be a range of servers out there and just getting the content of the file sent to you isnt going to be unusual. Probably plain Mac/CR is too ancient to be widely found, but see below. Another consideration that comes to me now is for any applications of Gemini that might require a binary correct transfer of a text file from A to B. For example if we are gemini as a content transfer layer for a source code repository. There is already a Git over gemini service which seems interesting. In these scenarios as the end user, you would expect that a GMI file served to you from the repo was the same as it is on the server. So I think we should support all "normal" plain text formats as the response body and not in general have the server munging or adjusting them. Best Wishes - Luke
But is it a zero sum game like that really (either make it hard on server authors or make it hard on content authors)? The would-be Gemini content author at this point in the game will be someone who either has created an SSH key and SSHed into a pubnix, used Git to send their content to a Dome style server or used sftp to upload their hand written Gemini text. Is it truly making their life harder to enter: gemini fmt mytext.gemini<Enter> and thereby save the server and client authors from having to do all of that checking logic? They're still are many other ways they can enjoy the moving target development against an evolving spec aren't there? :) I hope my idea doesn't seem hostile somehow, because it's not intended to be in any way. I just figure save everybody effort and complexity, everybody wins? feed two birds with one seed. Thanks On Thu, May 21, 2020, 16:52 solderpunk <solderpunk at sdf.org> wrote: > On Thu, May 21, 2020 at 04:28:19PM -0500, ??? wrote: > > > The people who are writing the server and client software are doing a > huge > > service to the Gemini community, so please don't saddle them with the > > admittedly tedious work of writing code to check in the server and then > > check again in the client if it is an improper combination of CR, LF or > CR > > LF and then even more code to re-conform it. > > I'd rather make life a little harder for developers - who are > technical people who know what a CR and LF are and are, anyway, signing > up for a bit of fiddly detail by undertaking to implement an internet > protocol from scratch - than make life harder for content authors, who > may have no idea what this nonsense is all about and are arguably doing > an even bigger service for the community by providing something to use > our present abundance of servers and clients to read. > > Sorry if this seems blunt! > > Cheers, > Solderpunk > > > seems to go against the 100 lines of code, code it in a weekend Gemini > > spirit in my opinion. > > > > Thanks for your consideration. > > > > J > > > > On Thu, May 21, 2020, 16:18 Martin Keegan <martin at no.ucant.org> wrote: > > > > > On Thu, 21 May 2020, solderpunk wrote: > > > > > > > We basically need to choose between forcing server authors to > normalise > > > > all endings to CRLF or forcing client authors to recgonise LF (even > > > > though it'll probably never be seen in the wild). > > > > > > Could we have a bit of a breather to allow the implications to sink in, > > > and, critically, to allow the development of conformance testing tools? > > > > > > If there were a tool which could be run on a document, that confirmed > that > > > it was conformant, and a similar tool for server behaviour, and people > > > had had some time to try to integrate these with the existing > > > software, it'd be easier to assess the tradeoffs involved in the spec > > > decision. > > > > > > Mk > > > > > > -- > > > Martin Keegan, +44 7779 296469, @mk270, https://mk.ucant.org/ > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20200521/0569 461b/attachment.htm>
On Thu, May 21, 2020 at 1:47 PM solderpunk <solderpunk at sdf.org> wrote: > Some quick experimentation with Python revealed that str.splitlines() > is, indeed, smart enough to handle CRLF, CR and LF as linebreaks. For what it's worth, the Rust standard library is designed to work with both LF ("\n") and CRLF ("\r\n") line endings, but not CR alone ("\r"): https://github.com/rust-lang/rfcs/blob/master/text/1212-line-endings.md In the discussion that led to this, there was no real support for supporting "\r" line endings, nor for other uncommon newlines like U+2028 LINE SEPARATOR: https://github.com/rust-lang/rfcs/pull/1212 I think text/gemini should allow both "\n" and "\r\n". Requiring a single style would either place a burden on content authors (on servers that don't auto-convert line endings), or make servers more complex and less efficient. (They would need to inspect every text/gemini file and possibly transform it, rather than stream it directly from disk to the network.) Meanwhile, I think client software can handle either option fairly simply. For the protocol header, I don't have a strong opinion. Any choice we make is fairly easy to implement in both server and client. However, I have a slight preference for specifying just "\n" as the header terminator. One byte is (marginally) more efficient than two, and in some languages it's slightly simpler to split on a single-byte delimiter.
I am a strong proponent of "follow the existing practice if there isn't good reason to change". I personally think CRLF should be recommended, but not required for content, as text/* is wont to do over HTTP; I think expecting clients to deal with any of {CR, LF, CRLF} is totally fine. However, the header should _absolutely_ end in CRLF, as every existing protocol works this way. Nicole ??????? Original Message ??????? On Thursday, May 21, 2020 5:45 PM, Matt Brubeck <mbrubeck at limpet.net> wrote: > On Thu, May 21, 2020 at 1:47 PM solderpunk solderpunk at sdf.org wrote: > > > Some quick experimentation with Python revealed that str.splitlines() > > is, indeed, smart enough to handle CRLF, CR and LF as linebreaks. > > For what it's worth, the Rust standard library is designed to work > with both LF ("\n") and CRLF ("\r\n") line endings, but not CR alone > ("\r"): > https://github.com/rust-lang/rfcs/blob/master/text/1212-line-endings.md > > In the discussion that led to this, there was no real support for > supporting "\r" line endings, nor for other uncommon newlines like > U+2028 LINE SEPARATOR: > https://github.com/rust-lang/rfcs/pull/1212 > > I think text/gemini should allow both "\n" and "\r\n". Requiring a > single style would either place a burden on content authors (on > servers that don't auto-convert line endings), or make servers more > complex and less efficient. (They would need to inspect every > text/gemini file and possibly transform it, rather than stream it > directly from disk to the network.) Meanwhile, I think client > software can handle either option fairly simply. > > For the protocol header, I don't have a strong opinion. Any choice we > make is fairly easy to implement in both server and client. However, I > have a slight preference for specifying just "\n" as the header > terminator. One byte is (marginally) more efficient than two, and in > some languages it's slightly simpler to split on a single-byte > delimiter.
Isn't CRLF a DOS/Windows thing? Why use it at all? -- gemini://kwiecien.us/
CRLF is the canonical representation of text/* file types; see https://tools.ietf.org/html/rfc2616#section-3.7.1 Nicole ??????? Original Message ??????? On Thursday, May 21, 2020 7:12 PM, Ben <benulo at systemli.org> wrote: > Isn't CRLF a DOS/Windows thing? Why use it at all? > > ------------------------------------------------------- > > gemini://kwiecien.us/
On Thu, May 21, 2020 at 7:02 PM Nicole Mazzuca <nicole at strega-nil.co> wrote: > However, the header should _absolutely_ end in CRLF, as every existing protocol works this way. Yeah, that's fair. I?ve just caught up on the parts of this thread that happened before I subscribed, and I agree that making the protocol work the same as other line-based protocols is compelling. (And Solderpunk said that this part is already decided, anyway, which is fine.)
jan6 at tilde.ninja writes: > I'd also be in favor of server handling it, although that is a kinda-valid point... > html doesn't count because the browsers will do their best to fix all kinds of TOTALLY BROKEN html, > you can have partial tags, no end tags for some, etc, and the browser will never tell the user your > site is a trashpile, it will just silently try its best to fix it up (even in inspect element, > you'd see the "repaired" version)... And this is exactly what's going to happen with gemini too. The second you have >=2 clients around, there's going to be a race to accept the widest range of content, regardless of errors. Besides pleaces where it puts users' privacy/security at risk, clients can never be expected to police complience of the document with the spec. (After all, the client user is almost always someone entirely unresponsible for the generation of the document.) -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 487 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20200522/fe16 b1ef/attachment-0001.sig>
On Fri, May 22, 2020 at 06:42:30AM +0430, Ben wrote: > Isn't CRLF a DOS/Windows thing? Why use it at all? It's not often I'll say anything that might be perceived as "sticking up for Microsoft" in the context of storing text (I mean, really, Office
solderpunk writes: > be different just for the sake of different. DOS inherited CRLF from > CP/M, and Windows is just the "last man standing" from a long CRLF > tradition dating back to the days of physical teletypes, and which > included systems with a lot more geek street cred, like DEC's TOPS-10 > and RT-11. Which makes sense, given that you need a line feed to advance to the next line, and a carriage return to return the carriage to the first column. :-) -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 487 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20200522/c0a3 434c/attachment.sig>
On Fri, May 22, 2020 at 09:01:45AM +0200, plugd wrote: > > jan6 at tilde.ninja writes: > > html doesn't count because the browsers will do their best to fix all kinds of TOTALLY BROKEN html, > > you can have partial tags, no end tags for some, etc, and the browser will never tell the user your > > site is a trashpile, it will just silently try its best to fix it up > > And this is exactly what's going to happen with gemini too. I kind of hope that it won't ever be possible for a text/gemini page to
On Fri, May 22, 2020 at 05:01:34PM +0200, plugd wrote: > Which makes sense, given that you need a line feed to advance to the > next line, and a carriage return to return the carriage to the first > column. :-) Yeah, it's not even a "DEC thing", it's a "cold, hard, mechanical reality" thing! Cheers, Solderpunk
Thanks, everybody, for your thoughts on this matter! I've made a decision. Let me start out by saying how thoroughly ridiculous it is that in 2020 people can still have spirited discussions about how the concept of "a line of text" should work! I hope that one day this is standardised once and for all across all systems, but I won't hold my breath. The line of thinking to my decision has gone something like this: While it's true that the spec currently is totally ambiguous on what our supposedly line-oriented format uses to define a line, and that's a genuine problem which needs to be fixed because specs *shouldn't* be ambiguous about this kind of thing (thanks to everybody who flagged this issue!), it's also true that as far as I'm aware this ambiguity has created precisely zero interoperability problems for anybody. In a situation where some spec detail is ambiguous but everything is working just fine 100%, the obvious default course of action should be to codify whatever the current practice is. That seems to be everybody using plain LF. So, there would need to be a very compelling reason *not* to allow plain LF. I can't think of any and nobody has mentioned any, so, first conclusion: plain LF has to be allowed. As previously noticed, it is already IETF-mandated that the canonical representation of anything with a text/* MIME type is CRLF. HTTP sets a precedent of protocols being able to permit additional line endings on top of that, but it seems a very different proposition for a protocol to
On 22-May-2020 17:07, solderpunk wrote: > Thanks, everybody, for your thoughts on this matter! I've made a > decision > > <snip> > > This solves the ambiguity of the spec on this matter, and should involve > very little actual work from anybody to attain/retain compliance - which > is exactly how it should be, given that everything is already working > just fine despite the ambiguity. In particular, given the complete lack > of modern systems using anything other than LF or CRLF, server munging > should not be necessary and Gemini content can be served as verbatim > binary data from the filesystem, which is a very desirable property Not that you need my endorsement, but this gets a thankful thumbs up from me. - Luke
solderpunk writes: > On Fri, May 22, 2020 at 09:01:45AM +0200, plugd wrote: >> And this is exactly what's going to happen with gemini too. > > I kind of hope that it won't ever be possible for a text/gemini page to > *be* a trashpile. It's genuinely very hard to create something > "invalid". There is no concept of anything coming in opening/closing > pairs. Whitespace is always optional so it doesn't matter whether it's > there or not. Definitely agree, at the moment it all seems very close to optimal. And sorry, I didn't mean to sound so defeatist! I was just getting worried that some of the ideas being proposed for closing exploitation holes in text/gemini were of this "unenforceable" variety. > A link line where the first thing after => can't be parsed as a URL is a > genuine invalidity, but remember that relative links are allowed, so > even: > > => one two three > > is fine (a relative link to a path ending in "one", with label "two > three"). Something like: > > => foo://bar://baz:// Chew on this! > > is a real problem, but it's not going to happen by accident. True. And _all_ clients would have to go out of their way and conspire together to interpret this as anything but garbage. ... then again, if clients start to act defensively and ignore malformatted link lines, it wouldn't be impossible for document authors to start incorporating intentionally malformatted lines containing non-standard directives: => foo://bar://inline-image:cat.jpg Has anybody developed a general theory of specification creep? :-) Tim -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 487 bytes Desc: not available URL: <https://lists.orbitalfox.eu/archives/gemini/attachments/20200522/0ead 0cbb/attachment.sig>
On 5/22/2020 at 9:13 AM, "Luke Emmet" <luke.emmet at gmail.com> wrote: > >On 22-May-2020 17:07, solderpunk wrote: >> Thanks, everybody, for your thoughts on this matter! I've made a >> decision >> >Not that you need my endorsement, but this gets a thankful thumbs >up >from me. Ditto. This seems very well thought out to me. Kudos! ~Simon
On Fri, May 22, 2020 at 10:24:55AM -0700, Simon Forman wrote: > On 5/22/2020 at 9:13 AM, "Luke Emmet" <luke.emmet at gmail.com> wrote: > > > >On 22-May-2020 17:07, solderpunk wrote: > >> Thanks, everybody, for your thoughts on this matter! I've made a > >> decision > >> > >Not that you need my endorsement, but this gets a thankful thumbs > >up > >from me. > > Ditto. This seems very well thought out to me. Kudos! > Thank you both! Cheers, Solderpunk
It was thus said that the Great plugd once stated: > > Has anybody developed a general theory of specification creep? :-) Yes. There's Zawinski's Law: Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can. You can read some commentary about it from these links: https://en.wikipedia.org/wiki/Jamie_Zawinski#Principles https://softwareengineering.stackexchange.com/questions/150254/what-does-ja mie-zawinskis-law-mean -spc (Every program has at least one bug and can be shortened by at least one instruction---from which, by induction, one can deduce that every program can be reduced to one instruction which doesn't work. [1]) [1] This actually happened. Not to me, but I know of such a case.
On Tue, May 19, 2020 at 09:21:13AM +0000, defdefred wrote: > On Tuesday 19 May 2020 02:07, Sean Conner <sean at conman.org> wrote: > > > I thought about autodetection---Unicode is defined in blocks, where each > > alphabet becomes a defined block in Unicode. I then realized that there are > > multiple languages that use the European block. Sure, detecting Greek is > > easy since they have their own alphabet, but what about Spanish, French and > > German? They use the same alphabet. > > Autodetection is necessary for document using multiple languages. > Browser preference is fine for hard to detect case. > I was thinking, maybe add a "lang" parameter to the end of mime-types, like how charset and boundary work. Then, if people want multiple languages per-"document"? I guess per-response. They can use mime multipart, then when the mime-type for each part is picked, they can put the language there. That doesn't really help with "inline" different languages. Example gemini response: 20 multipart/mixed; boundary=longrandomthing --longrandomthing Content-Type: text/plain; lang=en-US color. withOUT a 'u'. ha. --longrandomthing Content-Type: text/plain; lang=jp [insert some japanese here] --longrandomthing-- This document seems like it might come in handy: https://www.w3.org/International/articles/language-tags/
---