💾 Archived View for gemi.dev › gemlog › 2022-02-21-kennedy-cached-content.gmi captured on 2023-06-16 at 16:17:29. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-01-29)
-=-=-=-=-=-=-
2022-02-21 | #search #kennedy #delorean | @Acidus
Today I was trying to reply to an email about how Kennedy, my Gemini search engine, handles robots.txt rules. ~Solderpunk has a companion spec for using robots.txt in Gemini space, which is basically the original robots.txt spec, with really primitive "Disallow" rules, and some guidance on how to handle user-agents.
Unfortunately, I couldn't link to this content, because gemini.circumlunar.space is currently offline. (The DNS resolves, but nothing is listen on port 1965). It never occurred to me that one of the original capsules, containing the specifications for Gemini, would go offline, so I never thought to cache it. While people like ~ew make a local cached copy of content when they are replying to it on their gemlog, this wouldn't help me since I hadn't cached it.
Luckily, Kennedy's crawler keeps a local copy of documents in Gemini space. I do this so I can try different indexing and search strategies without having to do an entire re-crawl. This made it trivially easy to add a great new feature: View Cached Content.
So I have added "Cached copy" links to Kennedy search results, which allows you to view the cached copy of the URL at the time Kennedy crawled it.
Screenshot of Kennedy results with "Cached copy" link.
Kennedy results for "robots.txt"
This is super helpful. Sometimes when clicking on a search result, you get an error if that capsule is offline or otherwise unavailable. With cached copy, you can see it.
For example, here is the cached version of the robots.txt companion spec from gemini.circumlunar.space:
Kennedy's cached copy of the robots.txt companion spec page
While having an option to view cached content is great when looking at search results, it's not great if you want to see the cached copy for something specific. For example, I know that gemini.circumlunar.space is offline. I should be able to directly pull up the cached contents by the URL. I shouldn't have to try and find it via the search results, and follow the "Cached copy" link from there.
So I also built another feature I'm calling Delorean, after the DeLorean time machine from Back to the Future. Delorean allows you to provide a URL, and see its cached contents, if a cached copy exists in the search database.
🏎 DeLorean: View Cached Gemini content
These new features are not the same as the Internet Archive's Wayback Machine. I only keep a local cached copy for content that would appear in Kennedy search results. This means:
Building Delorean into something more like the Wayback Machine would be a bit involved. Funny enough, the robots.txt piece I could not link to talks about using different pseudo user-agents from a search engine (indexer) vs an "archiver" which specifically mentions the Wayback machine as an example. Right now, I'm just using the Kennedy search database, and I'm only keeping the latest copy, so I feel that using the "indexer" is probably OK. If I do more, I will need to start using the "archiver" rules, which could be different. Other challenges:
For now, I have a lot on my plate so creating a Wayback-style archive isn't a priority at all. However I'm very open to feedback about this. What would you want to see in a Gemini archive?