Today I have published an updated version of both Gemini specification documents. The current versions are designated 0.24.1. A full description of all changes follows. A total of six open issues on the GitLab repositories can be closed as a result of these changes, reducing the number of open issues from 31 to 25.
I consider all of these changes very minor, either addressing rare edge cases or fixing ambiguities and inconsistencies. It is possible that some software or Gemtext document files may need to be updated to achieve/maintain strict compliance, but practically speaking failure to do so is extremely unlikely to cause actual real world problems. I do not think there is any risk of these changes "breaking" existing software in the same way that the 0.24.0 changes accidentally did.
The Universal Character Set includes a character (U+FEFF) known as the Byte Order Marker (hereafter BOM). This is a non-visible character which can be included at the beginning of a Unicode string as a way to simultaneously signal to handling software that the string uses a Unicode encoding, to specify the particular encoding used (e.g. UTF-8 vs UTF-16), and in the case of encodings where differences in byte ordering conventions can cause ambiguity, to disambiguate the byte ordering.
The specification now says that the BOM character MUST NOT be used in Gemini response headers. Those headers MUST be UTF-8 encoded (this is now made explicit in the text, previously it was only conveyed via the ABNF), and UTF-8's design is self-disambiguating with regard to byte ordering, so none of the purposes of the BOM are necessary here.
The specification now also says that the BOM character SHOULD NOT be used in the body of unicode-encoded Gemini responses, with the encoding and byte-ordering instead being conveyed via the "charset" parameter in the response header.
These changes are consistent with the following recommendations from RFC3629 (see Section 6):
A protocol SHOULD forbid use of U+FEFF as a signature for those textual protocol elements that the protocol mandates to be always UTF-8, the signature function being totally useless in those cases.
and
A protocol SHOULD also forbid use of U+FEFF as a signature for those textual protocol elements for which the protocol provides character encoding identification mechanisms, when it is expected that implementations of the protocol will be in a position to always use the mechanisms properly.
The reason that the specification says BOMs "SHOULD NOT" rather than "MUST NOT" be used in response bodies is so that server authors are not required to have their servers modify user-uploaded files (which might begin with BOMS unbeknownst to a naive user) in order to ensure their servers are compliant.
This change allows the closing of GitLab issue P36, opened by Jaakko Keränen (aka skyjake). Thanks, skyjake!
The specification is now more careful when discussing the use of percent encoding in the context of user-supplied input in response to a status code beginning with 1: percent encoding is required in the request URL when user input is sent to the server but not in the input prompt sent with the originated response header.
There was no GitLab issue about this but I was informed of the ambiguity via an email. Thank you to its author!
The specification is now a little more explicit on the matter of clients reporting errors to users, which is especially important in light of the change made in specification 0.24.0 where error messages for status codes 4x and 5x were made optional. The optional error message SHOULD be displayed when it is present, but whether it is or is not, clients SHOULD also display a localized indication of the nature of the error on the basis of the status code alone.
While I was make this clarification, I corrected the previously erroneous count of the number of permanent error codes from five to four. I also made minor changes throughout the document to use the terminology "status code" and "response" header more consistently - previously there was occasional use of "response code" as well.
This change allows the closing of GitLab issue P15, opened by Sean Conner. Thanks, Sean!
The specification previously stated that "clients MUST include hostname information when making requests for URLs where the authority section is a hostname" while remaining silent on what clients should do when requesting URLs where the authority section is an IP address instead (note that the specification says this SHOULD NOT be done, but not that it MUST NOT, so the situation may arise). The correct action in this situation was not necessarily obvious, as a request could be made without including an SNI extension in the Client Hello message at all, or with an empty extension. The specification is now explicit that in this case no SNI should be sent.
This change allows the closing of GitLab issue P33, opened by nervuri and containing important information from Makeworld. Thank you both!
The specification of the `lang` parameter for the `text/gemini` media type specifies that valid values are "comma-separated lists of one or more language tags as defined in [BCP47]", and provides six examples. One example specifies a document in a mixture of French and English. The example incorrectly did not enclose the `en,fr` in quotation marks, as required by RFC2045. The example was corrected and the text before it explicitly clarifies that these quotation marks are required whenever commas are used.
This change allows the closing of GitLab issue G01. This defect was originally reported by makeworld on the Gemini mailing list. Thanks, makeworld!
At the same time I added a single extra sentence to the end of the section describing the media type parameters clarifying that "charset" and "lang" are the only ones defined at that clients MUST ignore any others. This was not strictly necessary, as it is not and never has been true that MIME media types are big trucks you can just dump parameters on whenever you feel like it, each type has a registered set of defined parameters. But it's not uncommon for people to think otherwise and I have had this pointed out more than once as a place where Gemini can be illicitly extended, so it can't hurt to be explicit about this.
As first mentioned in an earlier news post, the first ABNF specification for gemtext had a bug in it where a "gemtext-document" was defined as consisting of either zero or one "gemtext-line"s when the intent was, of course, one or more. This has now been fixed (much later than I said it would be).
2024-04-19 news post reporting "zero or one" line bug
When I announced in that news post that it should have been "one or more", I was asked via email whether or not it ought to really be "zero or more". It seems a little odd to me to define an empty document as a valid Gemtext document. I checked the RFCs for other simple line-oriented text formats, such as Comma-Separated Values (RFC4180), and they typically seem to define their files as being necessarily non-empty. Works for me.
Relatedly, the discussion in the now closed GitLab issue G05 contained a question by John Cowan regarding the validity of a document whose final line does not end with [CR]LF. As the ABNF currently stands, such a document is technically ill-formed, but it would of course be absurd for a client to refuse to render it. This is perhaps not an ideal situation, but I didn't want to rewrite the ABNF to be less clear just to address this detail. In reading RFC4180, I noticed that the CSV ABNF defines a CSV file as follows:
file = [header CRLF] record *(CRLF record) [CRLF]
Here it is clear that every record after the first is separated from the one before it by a CRLF, but the CRLF after the final record is explicitly optional, and files meet the definition with or without it. This is actually kind of nice and is easy to read, so why not define Gemtext this way?
Because empty text lines are valid (and widely used) in Gemtext documents. If a formulation like the above were used, it would be ambiguous whether or not every document which did end with a CRLF did or did not also include an empty text line after it which didn't include the optional final newline. Since empty text lines are supposed to be rendered individually each time they occur, this ambiguity actually has consequences. Absolutely trivial consequences, it's true, but the problem of documents without final newlines being ill-formed is trivial too.
GitLab issue G06 (opened by Felix Queißner) specifically asked for clarity on issues like this, regarding how to interpret various kinds of seemingly "empty" Gemtext documents. If I'd used a CSV-style ABNF formulation to enable giving a more satisfying response to G05, I would have then been in a position where I couldn't give a good response to G06 at all. Damned if you do, damned if you don't. I consider the solution used by the latest specification to be the slightly less damned choice, because I believe it provides clear and explicit answers to every one of Felix's edge cases. I also suspect that documents which do not end in newlines to be in the minority, although I am awaiting hard data on that.
I additionally made minor additional changes to wording regarding "blank lines" (now called "empty lines") and also clarified that Gemtext is specified in its "canonical form" (and therefore uses CRLF everywhere).
All these changes together allow the closing of the GitLab issue G06 mentioned above and also G14 (opened by Philip Linde and asking "what constitutes a line?"). Thanks, Felix and Philip!
Previously the ABNF specification of Gemtext was written such that link lines may only use the space character, not space and/or tab, to separate link "names" from the URL. This was in contrast with the written specification of link lines, which allowed both. Both are naturally valid and widely used, and so the ABNF was changed to reflect this.
The ABNF definition of text lines was also changed to allow the use of tabs in text.
These changes are relevant to GitLab issue G15 (opened by nervuri) which asks about clarify and consistency on whitespace. I am leaving that issue open, because I do not consider that I have yet explicitly addressed the question of whether or not the mandatory whitespace in list item lines should be similarly loosened. I am actually seriously considering reversing the decision to make that space mandatory. Thus I mentally consider part of G15 and the entirely of G10 (opened by cage21 and asking for mandatory whitespace after the "=>" in link lines) to now constitute a single logical issue which we might title something like "just make whitespace requirements entirely consistent across all line types already!". I will close them both once I've made a decision on this, which I will do before 0.24.2 is released. I have asked Acidus, who operates both a search engine and a Wayback-style archive for Geminispace, for some hard data to help inform these choices.
As absurd as it may sound, the ABNF previously did not allow non-ASCII characters in text lines. Now it does.
I am slightly uncomfortable that the ABNF formulation is written at the level of UTF-8. That's fine for the network specification, where response headers are explicitly UTF-8, but Gemtext documents can in principle be legally encoded in whatever you like. I guess the ABNF should instead be written at the level of the Universal Character Set, rather than any specific encoding of it, but that's a non-trivial amount of work. I do not know whether or not doing this would implicitly forbid certain text encodings which cannot be mapped to UCS (or whether such encodings even exist). I am thus considering updating the specification to just make UTF-8 encoding of Gemtext mandatory (assuming media types which are subtypes of "text" are even allowed to do this, which I'm currently unclear on), as this would solve the problem with a minimum of work. Again I have requested hard data on the prevalence of non-UTF-8-encoded text/gemini content in current Geminispace to inform this decision.