The spec for Gemini and Gemtext are short and easy to read. However there is some quirks that I think are easy to overlook. Additionally, there is what the specification says, and what capsules actually do in Geminispace. This is a collection of strange things that have caused me problems as the author of a crawler, search engine, and archive.
For success responses, the MIME Type is optional. So while most responses look like this:
20 text/gemini\r\n
This is also valid:
20 \r\n
(NOTE: That is 20, follow by a space (0x20), followed by a CRLF).
If you don't receive a mimetype, the assumed MIME type is "text/gemini".
For error responses (4x, 5x, 6x), the error message is optional. So this is valid error response:
40 \r\n
3x responses are redirects. While most capsules use absolute URLs, the spec allows for both relative or absolute URLs. Some capsules send protocol relative URLs (e.g. "//example.com/foo" instead of "gemini://example.com"). Be prepared to handle all sorts of URLs. Some capsules will use a redirect to send you to a non-Gemini URL. This is allowed, and you should account for it.
Some older Gemini servers (such as bleyble.com) were built while the Gemini protocol was still in flux. These send a tab (0x09) between the status code and the meta field in a response line. Most clients like Lagrange seem to handle this OK. You should handle it as well.
Some servers will throw an error message if you include the port number as part of the URL. So for those, sending a request like this
gemini://example.com/
will work, but a request like this
gemini://example.com:1965/
throws an error. I have no idea why this happens. It is probably a bug in a commonly used server. While yes, that should be fixed, if you want to access those systems you need to work around it. As such I don't send a port number as part of the URL in the Gemini request if the port is the default Gemini port of 1965.
The vast majority of content in Gemini space uses the MIME types text/gemini or text/plain. These are usually correct. Once you start getting into image/* MIME types, or others esoteric types, the given MIME type is often incorrect. Don't assume file format from MIME type. Use a real file format parsing library, based on magic numbers, like the `file` command.
The Gemini spec recommends using the "lang=" parameter on the MIME type to specify a language. Most capsules don't do this. You will find content written in French or Russian, that was not sent with a lang parameter.
How a page should specify that it contains content in multiple languages is not well defined. Some capsules send a comma separate list of languages like "text/gemini;lang=de,en". The comma character is actually not allowed inside the value of a MIME type parameter, so this is invalid, and many MIME parsing libraries will throw an error.
The best way to know the language content is to use a detection algorithm like ngrams. Note that this can fail on short text, if run against preformatted sections.
Misconfigured capsules exist that will send "lang=" parameters on content that doesn't make sense, like "image/png;lang=en". This can break naive MIME type parsing code.
In Gemini, the default character encoding for all text/* MIME types is UTF-8. Most modern content is written in UTF-8, and most "text/gemini" content is in fact using UTF-8.
If content uses another encoding the server is supposed to specify this via the "charset=" parameter on the MIME type. Many sites fail to do this. For example, there are numerous Gemini mirrors of Textfiles.com. These files date from the 1980s and use extended ASCII or other character sets which don't render properly if assumed to be UTF-8.
Automatic charset encoding detection is a well researched and difficult problem. There are no silver bullets, only trade offs. This makes it very difficult to reliably parse or index text/plain files.
Misconfigured capsules exist that will send charset parameters for content that doesn't make sense, like "image/png;charset=utf-8". This can break naive MIME type parsing code.
Block quote lines don't have to have a leading space:
>This is valid > So is this
List item lines *DO* have to have a leading space:
Header lines don't have to have a space:
#Here is a valid header ## Here is a valid header as well
Link lines don't have to have a leading space.
=>gemini://example.com/ this is totally valid
This might be the most poorly worded part of the spec. It defines whitespace as "any non-zero number of consecutive spaces or tabs" which would make you think that at least 1 whitespace character is required between `=>` and the URL. However in the definition, the whitespace is enclosed in brackets, to which the spec then says "Square brackets indicate that the enclosed content is optional." So zero or more whitespace is allowed. However, then all the examples of link lines in the spec proceed to ALWAYS use whitespace between the `=>` and the URL, which reenforces the idea that at least 1 character is required. Like I said, it's confusing! Just ensure your code can properly parse link lines with 0...N amount of whitespace before the URL, since Gemtext using no whitespace exists out there. Also, unlike other specs that allow uncommon characters like vertical tabs to be whitespace, Gemini only allows whitespace to be \s or \t. So:
=>\t\t\t\t\t\t\t\tgemini://example.com/ lots of leading tabs, still works as a link
Header lines are not required to be in any order. While uncommon, you will find gemtext with out-of-order header lines like this:
hello ### first header, but at a depth of 3 blah # now a "higher" header? yep, stuff is ##crazy
A tab is a single character, ASCII code 9. Text programs are free to decide how to render a tab. Historically, on computers, tabs were rendered as 8 whitespace characters (ASCII code 32), but more modern programs allow you to change this, or default to another number of characters like 4. As such you can't depend on how tabs will be rendered in preformatted text blocks. Something may look fine in your text editor, but be rendered differently:
Amfora using 4 characters when rendering a tab
Lagrange using 8 characters when rendering a tab
cat using 8 characters when rendering a tab
TextMate using 4 characters when rendering a tab
If you leave tabs in your preformatted text, it isn't really preformatted anymore, since you don't know how it will be rendered. Just use spaces instead.
Gemini specifies that only a subset of Robots.txt is valid. Specifically it does not support:
Many capsules will attempt to use these. That behavior is undefined.
Many capsules are TLS 1.3 only. Make sure you are using a TLS library that supports it.
The spec defines that gemini servers should send the TLS close_notify alert when closing the connection, to tell the client that all data has been sent. This revealed bugs in many implementations and wrappers of various TLS libraries, all of which have not been fixed. As such, some capsules still do not properly do this. Client libraries should be tolerant.
Detecting this is sometimes challenging, depending on the tools you have. @mozz's HTTP-to-Gemini gateway will show the state of close_notify in the certificate view, and can be used to test: