💾 Archived View for gemi.dev › gemlog › 2023-07-06-file-search.gmi captured on 2023-07-10 at 13:33:49. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
2023-07-06 | #kennedy #search | @Acidus
My Gemini search engine Kennedy now lets you filter a search query to a specific type of file like PDFs, MP3s, or ZIPs. This makes it really easy to specific types of files about a subject.
To use this new feature, just add "filetype:[whatever]" to your query, as shown in the examples above.
While other Gemini search engines have had a limited abilities to do this, I improved upon it in several ways:
The "More Info / Archived Copy" link on each result is especially helpful when searching for files. This gives you meta data about the file including links to the pages which link to the file. Visiting those is a great way to get additional context about a file:
Page Info: /~anthonyg/docs/unixpowertools.pdf
As mentioned in a previous post on indexing plain text files, MIME types are notoriously unreliable. Not only can a file have the wrong MIME type, you could have multiple MIME types that all mean the same "type" of file. So searching by MIME type won't find all the files of that type. On top of that, all of this assumes the user even knows the MIME type for a type of file to begin with.
MIME Lies: Indexing Plain Text files in Kennedy
So clearly, MIME types were out.
Instead, I do what Google and other web search engines do, and determine the type of file something is. If you want PDFs, Kennedy should be able to find PDFs for you, regardless of whatever gross or incorrect MIME types various capsules used to server them. Right now I'm using a pretty primitive way to determine file type, but I can always improve my file type detection code to recognize more file types, without you needing to use a different search syntax.
That gets me a better way to filter to certain file types. How about the actually searching? Indexing files, especially non-text files like a ZIP files, is a great example of how you can use the hyperlink nature of Gemini to your advantage. I don't need to index the contents of a PDF to know its the Apple Macintosh business plan. The text used in hyperlinks that point to the file provide a lot of context about the contents of the file.
Preliminary Macintosh Business Plan (1981)
I used the exact same strategy when I rolled out image search for Kennedy last year. To learn more about link text and path structure can be used in search indexes, read that post:
Finding images on Gemini with Kennedy Image Search
You don't have to specify a search query. If you want all the mp4 files in Geminispace, you can just do a search with a `filetype` operator, and no search terms:
All MP4 videos in Geminispace.
Turns out someone is hosting the class zombie movie Night of the Living Dead 🧟
I'm always making changes to Kennedy, and much of it based on feedback I get. Give it a try and let me know what you think.