💾 Archived View for gemi.dev › gemlog › 2023-12-09-numbering-sanity.gmi captured on 2024-05-12 at 15:40:14. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-12-28)
-=-=-=-=-=-=-
2023-12-09 | #mailinglist #mbox | @Acidus
I recently created a gemtext version of the Gemini Mailing List:
🛰🦊❤️ Orbital Fox Redux: Complete Mirror of Gemini Mailing List
Complete, threaded archive of the Gemini Mailing List
Orbital Fox (🦊🛰), the server that had hosted the mailing list before it died also provided an HTML archive. People posting to the mailing list would often include hyperlinks to this HTML archive to point to previous discussions or decisions. Links to previous mailing list messages took the form of:
https://lists.orbitalfox.eu/archives/gemini/[YEAR]/[6 DIGIT NUMBER].html
I really wanted to be able use the same numbering convention in my gemtext archive, so that I could re-write the hyperlinks to point to the correct message in my archive. As I mentioned in a previous post, six digit number was *mostly* increasing, however there were some odd jumps. Message #125 would appear chronologically *before* message #124, and some numbers wouldn't be used at all.
There were enough oddities that, over the entire 7700+ messages, the numbers would be way off by the end. I couldn't figure out this crazy logic that seemed to create the number series, so I couldn't use the same numbers for my archive. Super bummer. I wrote about my frustrations:
Help wanted: Recovering the actual message numbers from the Mailing List archive
This generated some ideas from the community but mostly it was stuff that I had already tried, (don't order with timezones, try UTC, etc.) which didn't work. Besides, no one had any idea that would explain the missing missing numbers at all. So I was stuck.
And much as as I tried to move on, I kept coming back trying to figure out what was going on.
While hacking on this problem, I noticed something odd. Orbital Fox's 2019.mbox file you can download from the Wayback machine has 294 messages in it. But the saved HTML archive page only has 289 messages for all of 2019...
Archive 2019 mbox with 294 messages
Orbital Fox's HTML page for 2019 showing only 289 messages
Turns out I made 2 wrong assumptions. First was that I assumed that an email message would appear only once in an mbox file. It's not true! Looking at the 2019 mbox file, I found that it actually contains duplicate emails entire emails which appear twice, verbatim, including headers like Message-ID header.
How many duplicate messages? 5. And 294 messages - 5 duplicates = 289 messages. So the HTML view is showing the unique messages, which makes sense. So what messages appear multiple times in the mbox file? It turns out the same messages that appear to have out-of-order gaps!
Here is an example. The email from Jason McBrayer, sent on 2019-09-07 at 21:38:43 UTC, with Message-ID "878sqzrgdo.fsf@cassilda.carcosa.net" appears twice in the mbox file. The first time it appears, it is given the message number #121. However the same email appears again in the mbox file, as message number #125. According to the algorithm, we drop the first email and just use the message number (#125) from the second copy. This is why #121 is not used in the Orbital Fox archive, and message #125 appears immediately after #120.
In retrospect, that seems pretty obvious. Why didn't I see this duplicate messages and their effect earlier? That was my second mistaken assumption. I assumedI assumed all the different mbox files that people had saved or made available contained the same messages!
Some of the mbox files you can find are not the original mbox files, or says the original mbox files all concatenated together into a single "complete" mbox file. Some of them were created by importing the original mbox files into some mail client and then exporting the messages out as a new mbox. Depending on the mail client, this process de-dups the messages mbox. So, depending on the mbox file I was working with, I would have different message counts and order.
I was mostly working an "all.mbox" file, which I assumed was the same as the individual mboxes. It was only when, trying to troubleshoot the numbering, I switched to using the original mbox files from Orbital Fox.
In hindsight I probably could have avoided all of this by just looking more at the software Orbital Fox used to manage the mailing list and generate the HTML archive. It used GNU Mailman 2.x. By looking at the code, or even running the source mboxes through Mailman, I probably could have avoided all this work. But that would not have been as much fun.
Regardless of how I got here: