💾 Archived View for gemi.dev › gemlog › 2023-06-28-plain-text-index.gmi captured on 2023-07-10 at 13:34:19. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

MIME Lies: Indexing Plain Text files in Kennedy

2023-06-28 | #kennedy #search #txt | @Acidus

I recently enabled indexing plain text files for Kennedy, my search engine. The contents of plain text files are now indexed so you can search for them.

Given the large libraries and mirrors present in Geminispace (Textfiles.com mirrors, RFCs, Amiga and Apple II archives), this adds a huge amount of content that can now be more easily discovered. 33,000+ more files are now indexed, representing about 10% of all documents!

As an example, here is a search for 'phreak'. While there aren't a lot of gemtext files about hacking and phreaking, there are a ton of text files 😈

Kennedy Search for 'phreak'

I've wanted to do this for a while, but indexing arbitrary text documents for search isn't as easy as it may appear.

How?

You can't just take all text files, stick them in a full text search, and expect it to be a good experience. That's because the contents of a "text file" can vary a lot, and can pollute the full text index, resulting in poor results for searches. For example:

Source code files: shell scripts, C, perl, rust, go, C#, Java. These have a ton of non-text symbols, and "words but not really words" like variable names or keywords.
ASCII art and formatting: With gemtext, I can just skip preformatted sections when indexing a document. I have no way of knowing what parts of a text file are text, and what is just symbols. Kind of a more extreme version of the source code problem.
The "Needle-in-a-haystack" problem: Text files can be quite large and a search term appearing two or three times in a 200 KiB document isn't a good result. This increases the sophistication required to score a search result.

However the biggest challenge is that MIME type lies.

The MIME Lies!

Because gemtext is a very recent format, if something is served with a "text/gemini" MIME type, you can be reasonably certain that it is in fact gemtext. This sadly is not true for responses with the "text/plain" MIME type.

There are many files that are sent with a "text/plain" MIME type, that aren't text files. This is probably due to misconfigured servers, or files that were named incorrectly. Regardless of how a non-text file was sent with a "text/plain" MIME type, treating it as text is bad. For example, taking a PDF or image or other binary file, interpreting its bytes as an ASCII or UTF-8 text file, and then ingesting that "text" into the search index explodes the size, lowers performance, and returns terrible results. So we need to avoid that.

The MIME lies the other way too. A file may not have a "text/plain" MIME type, but is in fact a text file. Why? because many servers fall back to using a MIME type based on the file extension, so if your text file has an odd file extension, it is served with the wrong MIME type. On most Unix-like systems, you can see this mapping between MIME and file extensions in the file `/etc/mime.types`.

Ever wonder why search engines or statistics about Gemini say there are 3000+ "application/x-mscardfile" files?

gemini://gemini.bortzmeyer.org/software/lupa/stats.gmi

Are there really thousands of files using an obscure early 90's Microsoft data format on Gemini?

"Cardfile" on Gemipedia

Of course not. It turns out blitter.com is hosting a mirror of the On-line Guitar Archive (OLGA), an archive of lyrics and guitar chords for thousands and thousands of songs

21st Century Digitial Boy, by Bad Religion

These files *are* text files but because they have a .CRD or .TAB file extension, they get served with an incorrect mime type. Some of the various Textfiles.com mirrors in Gemini have the same problem. They are text files, with odd extensions, and thus are served with obscure and incorrect MIME types.

In short, you can't trust MIME types.

Mislabeled Text Resources

To index all plain text files, I needed a way to find text files with the wrong MIME type. Determining the actual content type an arbitrary lump of bytes represents is a big topic. However, for now, I just need to know whether an arbitrary lump of bytes is a text document or not. Luckily, the WHATWG have an entire standard on how to determine various file types, including a great section describing an algorithm to (mostly) distinguish if a resource is text or binary:

WHATWG Mime Sniffing Living Standard: Sniffing a mislabeled binary resource

I'm using this algorithm to help me identify and index text files.

Feedback wanted.

I'm always making changes to Kennedy, and much of it based on feedback I get. Give it a try and let me know what you think.

Contact me