Oddities with Gemini, Gemtext, and Geminispace

The spec for Gemini and Gemtext are short and easy to read. However there is some quirks that I think are easy to overlook. Additionally, there is what the specification says, and what capsules actually do in Geminispace. This is a collection of strange things that have caused me problems.

Gemini Protocol

For success responses, the MIME Type is optional. So while most responses look like this:

20 text/gemini\r\n

This is also valid:

20 \r\n

(NOTE: That is 20, follow by a space (0x20), followed by a CRLF).

If you don't receive a mimetype, the assumed MIME type is "text/gemini".

For error responses (4x, 5x, 6x), the error message is optional. So this is valid

40 \r\n  

3x responses are redirects. While most capsules use absolute URLs, the spec allows for both relative or absolute URLs. Some capsules send protocol relative URLs (e.g. "//example.com/foo" instead of "gemini://example.com").

MIME types

The vast majority of content in Gemini space uses the MIME types text/gemini or text/plain. These are usually correct. Once you start getting into image/* MIME types, or others esoteric types, the given MIME type is often incorrect. Don't assume file format from MIME type. Use a real file format parsing library, based on magic numbers, like the `file` command.

Language attributes

The Gemini spec recommends using the "lang=" attribute on the MIME type to specify a language. Most capsules don't do this. You will find content written in French or Russian, that was not sent with a lang attribute.

How a page should specify that it contains content in multiple languages is not well defined. Some capsules send a comma separate list.

The best way to know the language content is to use a detection algorithm like ngrams. Note that this can fail on short text, if run against preformatted sections.

Misconfigured capsules exist that will send "lang=" attributes on content that doesn't make sense, like "image/png;lang=en". This can break naive MIME type parsing code.

Charset attributes

In Gemini, the default character encoding for all text/* MIME types is UTF-8. Most modern content is written in UTF-8, and most "text/gemini" content is in fact using UTF-8.

If content uses another encoding the server is supposed to specify this via the "charset=" attribute on the MIME type. Many sites fail to do this. For example, there are numerous Gemini mirrors of Textfiles.com. These files date from the 1980s and use extended ASCII or other character sets which don't render properly if assumed to be UTF-8.

Automatic charset encoding detection is a well researched and difficult problem. There are no silver bullets, only trade offs. This makes it very difficult to reliably parse or index text/plain files.

Gemipedia: Charset detection

Misconfigured capsules exist that will send charset attributes for content that doesn't make sense, like "image/png;charset=utf-8". This can break naive MIME type parsing code.

Server Weirdness

gemini://example.com/

will work, but a request like this

gemini://example.com:1965/

throws an error.

Robots.txt

Gemini specifies that only a subset of Robots.txt is valid. Specifically it does not support:

Many capsules will attempt to use these. That behavior is undefined.

TLS 1.3

Many capsules are TLS 1.3 only. Make sure you are using a TLS library that supports it.

Gemtext

Block quote lines don't have to have a leading space:

>This is valid
> So is this

List item lines *DO* have to have a leading space:


Link lines don't have to have a leading space.

=>gemini://example.com/ this is totally valid

The spec is a little obtuse about this, and all the examples show whitespace existing. In fact, 0...N amount of whitespace is allowed, and whitespace is only allowed to be \s or \t.

=>\t\t\t\t\t\t\t\tgemini://example.com/ lots of leading tabs, still works as a link

Header lines don't have to have a space:

#Here is a valid header
## Here is a valid header as well

Header lines are not required to be in any order. While uncommon, you will find gemtext with out-of-order header lines like this:

hello
### first header, but at a depth of 3
blah
# now a "higher" header?
yep, stuff is
##crazy