💾 Archived View for gemi.dev › gemlog › 2022-08-15-drew-devault-mirror.gmi captured on 2024-12-17 at 09:49:31. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-04-19)
-=-=-=-=-=-=-
2022-08-15 | #mirrors | @Acidus
As mentioned on Station, Drew DeVault's capsule has been offline for a month or so:
Seems drewdevault finally pulled the plug on their capsule. Sad to see it go"
I wrote to Drew DeVault asking if he can restore his capsule... Will he ever reply to me?
Luckily, I was able to reconstruct most of Drew's capsule using saved content from older crawls from Kennedy, my Gemini search engine. I rewrote the internal hyperlinks to be relative links, so you can read the capsule online or off.
[Update: Drew's capsule, and any others, are now available via Delorean Time Machine]
Archive of Drew DeVault's capsule
The capsule had some CGIs which obviously won't function. Also, my captured data predated Kennedy's image search feature, so I was only storing responses with a "text/*" MIME type. There are about 30 images on his capsule that I don't have a copy of. All in all, I salvaged 110 pages.
Usually search engines have a centralized database of results, which the crawler uses determine what content should be visited and refreshes the results over and over again, continuously. I tend to make a lot of changes to Kennedy, the how the crawler works, data it collects, and how that is stored. This is true today and was certain true in the first few months of building Kennedy, as I was organically figuring it all out. So from the very beginning, I wrote Kennedy crawler to always start fresh. Each time the crawler runs, it produces a new search database, and a data store contains saved copies of all the responses.
This self-contained approach turned out to be super helpful:
A side benefit of this approach is that I tend to have older copies of the search database and data store scattered around, including a copy from mid-June that had ~140 files from Drew's capsule.
Oh hot damn! Last week @freezr posted about trying to get Drew DeVault's capsule back online. I went looking at data from old Kennedy crawls and found I had visited 124 URLs on his capsule in mid June. Back then I only cached text content, which returned a status of 20. So I have 104 gemtext pages from Drew's Capsule. I need to write some code to export that (maybe make it a gempub as well) and then I'll post it back on line! Saving full bodies, FTW!
I wrote code that pulled all this content out, and saved it to files. I've done similar work with website data in the past. Usually there are problems when the characters in the URL are not allowed in the file name. Things like query strings are especially annoying, and file systems often have limits of the maximum length of a path, which makes it difficult to have clear URL-to-file mappings. Luckily, most capsules tend to not use query strings, and the URLs are fairly simple.
The surprisingly hard part of this project was writing code that would rewrite the links in Drew's gemtext to be relative links. This was critical to allowing a reader to navigate around the extracted pages. I'll still consider creating a gempub of the content at some point. Besides Langrange, I don't know any client's that support it. The work seems to have stalled on the spec:
Does anyone use a client that supports it?