💾 Archived View for her.st › holy-texts › oddities.gmi captured on 2023-07-22 at 16:34:15. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-01-29)
-=-=-=-=-=-=-
archived 2022-09-06
The spec for Gemini and Gemtext are short and easy to read. However there is some quirks that I think are easy to overlook. Additionally, there is what the specification says, and what capsules actually do in Geminispace. This is a collection of strange things that have caused me problems.
For success responses, the MIME Type is optional. So while most responses look like this:
20 text/gemini\r\n
This is also valid:
20 \r\n
(NOTE: That is 20, follow by a space (0x20), followed by a CRLF).
If you don't receive a mimetype, the assumed MIME type is "text/gemini".
For error responses (4x, 5x, 6x), the error message is optional. So this is valid
40 \r\n
3x responses are redirects. While most capsules use absolute URLs, the spec allows for both relative or absolute URLs. Some capsules send protocol relative URLs (e.g. "//example.com/foo" instead of "gemini://example.com").
The vast majority of content in Gemini space uses the MIME types text/gemini or text/plain. These are usually correct. Once you start getting into image/* MIME types, or others esoteric types, the given MIME type is often incorrect. Don't assume file format from MIME type. Use a real file format parsing library, based on magic numbers, like the `file` command.
The Gemini spec recommends using the "lang=" attribute on the MIME type to specify a language. Most capsules don't do this. You will find content written in French or Russian, that was not sent with a lang attribute.
How a page should specify that it contains content in multiple languages is not well defined. Some capsules send a comma separate list.
The best way to know the language content is to use a detection algorithm like ngrams. Note that this can fail on short text, if run against preformatted sections.
Misconfigured capsules exist that will send "lang=" attributes on content that doesn't make sense, like "image/png;lang=en". This can break naive MIME type parsing code.
In Gemini, the default character encoding for all text/* MIME types is UTF-8. Most modern content is written in UTF-8, and most "text/gemini" content is in fact using UTF-8.
If content uses another encoding the server is supposed to specify this via the "charset=" attribute on the MIME type. Many sites fail to do this. For example, there are numerous Gemini mirrors of Textfiles.com. These files date from the 1980s and use extended ASCII or other character sets which don't render properly if assumed to be UTF-8.
Automatic charset encoding detection is a well researched and difficult problem. There are no silver bullets, only trade offs. This makes it very difficult to reliably parse or index text/plain files.
Misconfigured capsules exist that will send charset attributes for content that doesn't make sense, like "image/png;charset=utf-8". This can break naive MIME type parsing code.
gemini://example.com/
will work, but a request like this
gemini://example.com:1965/
throws an error.
Gemini specifies that only a subset of Robots.txt is valid. Specifically it does not support:
Many capsules will attempt to use these. That behavior is undefined.
Many capsules are TLS 1.3 only. Make sure you are using a TLS library that supports it.
Block quote lines don't have to have a leading space:
>This is valid > So is this
List item lines *DO* have to have a leading space:
Link lines don't have to have a leading space.
=>gemini://example.com/ this is totally valid
The spec is a little obtuse about this, and all the examples show whitespace existing. In fact, 0...N amount of whitespace is allowed, and whitespace is only allowed to be \s or \t.
=>\t\t\t\t\t\t\t\tgemini://example.com/ lots of leading tabs, still works as a link
Header lines don't have to have a space:
#Here is a valid header ## Here is a valid header as well
Header lines are not required to be in any order. While uncommon, you will find gemtext with out-of-order header lines like this:
hello ### first header, but at a depth of 3 blah # now a "higher" header? yep, stuff is ##crazy