The spec for Gemini and Gemtext are short and easy to read. However there is some quirks that I think are easy to overlook. Additionally, there is what the specification says, and what capsules actually do in Geminispace. This is a collection of strange things that have caused me problems.
For success responses, the MIME Type is optional. So while most responses look like this:
20 text/gemini\r\n
This is also valid:
20 \r\n
(NOTE: That is 20, follow by a space (0x20), followed by a CRLF).
If you don't receive a mimetype, the assumed MIME type is "text/gemini".
For error responses (4x, 5x, 6x), the error message is optional. So this is valid
40 \r\n
3x responses are redirects. While most capsules use absolute URLs, the spec allows for both relative or absolute URLs. Some capsules send protocol relative URLs (e.g. "//example.com/foo" instead of "gemini://example.com"). We prepared to handle all sorts of URLs. Some capsules will use a redirect to send you to a non-Gemini URL. This is allowed, and you should account for it.
Some older Gemini servers (such as bleyble.com) were built while the protocol was still in flux. These send a tab (0x09) between the status code and the meta field. Most clients like Lagrange seem to handle this OK.
Some servers will throw an error message if you include the port number as part of the URL. So for those, sending a request like this
gemini://example.com/
will work, but a request like this
gemini://example.com:1965/
throws an error. I have no idea why this happens. As such I don't send a port number as part of the URL in the request if the port is the default Gemini port of 1965.
The vast majority of content in Gemini space uses the MIME types text/gemini or text/plain. These are usually correct. Once you start getting into image/* MIME types, or others esoteric types, the given MIME type is often incorrect. Don't assume file format from MIME type. Use a real file format parsing library, based on magic numbers, like the `file` command.
The Gemini spec recommends using the "lang=" attribute on the MIME type to specify a language. Most capsules don't do this. You will find content written in French or Russian, that was not sent with a lang attribute.
How a page should specify that it contains content in multiple languages is not well defined. Some capsules send a comma separate list.
The best way to know the language content is to use a detection algorithm like ngrams. Note that this can fail on short text, if run against preformatted sections.
Misconfigured capsules exist that will send "lang=" attributes on content that doesn't make sense, like "image/png;lang=en". This can break naive MIME type parsing code.
In Gemini, the default character encoding for all text/* MIME types is UTF-8. Most modern content is written in UTF-8, and most "text/gemini" content is in fact using UTF-8.
If content uses another encoding the server is supposed to specify this via the "charset=" attribute on the MIME type. Many sites fail to do this. For example, there are numerous Gemini mirrors of Textfiles.com. These files date from the 1980s and use extended ASCII or other character sets which don't render properly if assumed to be UTF-8.
Automatic charset encoding detection is a well researched and difficult problem. There are no silver bullets, only trade offs. This makes it very difficult to reliably parse or index text/plain files.
Misconfigured capsules exist that will send charset attributes for content that doesn't make sense, like "image/png;charset=utf-8". This can break naive MIME type parsing code.
Gemini specifies that only a subset of Robots.txt is valid. Specifically it does not support:
Many capsules will attempt to use these. That behavior is undefined.
Many capsules are TLS 1.3 only. Make sure you are using a TLS library that supports it.
Block quote lines don't have to have a leading space:
>This is valid > So is this
List item lines *DO* have to have a leading space:
Link lines don't have to have a leading space.
=>gemini://example.com/ this is totally valid
This might be the most poorly worded part of the spec. It defines whitespace as "any non-zero number of consecutive spaces or tabs" which would make you think that at least 1 whitespace character is required between `=>` and the URL. However in the definition, the whitespace is enclosed in brackets, to which the spec then says "Square brackets indicate that the enclosed content is optional." So zero or more whitespace is allowed. However then all the examples of link lines in the spec proceed to ALWAYS use whitespace between the `=>` and the URL, which reenforces the idea that at least 1 character is required. Like I said, its confusing! Just ensure your code can properly parse link lines with 0...N amount of whitespace before the URL, since Gemtext using no whitespace exists out there. Also, unlike other specs that allow uncommon characters like vertical tabs to be whitespace, Gemini only allows whitespace to be \s or \t. So:
=>\t\t\t\t\t\t\t\tgemini://example.com/ lots of leading tabs, still works as a link
Header lines don't have to have a space:
#Here is a valid header ## Here is a valid header as well
Header lines are not required to be in any order. While uncommon, you will find gemtext with out-of-order header lines like this:
hello ### first header, but at a depth of 3 blah # now a "higher" header? yep, stuff is ##crazy