I'm having a lot of fun writing the email indexing program [1], despite having to code around a few broken mbox [2] files. I've also been surprised at what I've found so far (not in the “oh, I forgot about that email!” way but more in the “What the—?” way).
At first, I assumed that no email header would be longer than 64K (kilobytes) [3], but no, turns out that isn't big enough. Turns out I have an email with a header that is 81,162 bytes in size, and it has enough email addresses (in the Cc: header) to populate a small mass-mailing list (and yes, it's spam).
I'm also tracking unique sets of headers and unique message bodies (via the SHA1 [4] hashing function). There are 118 messages with the same body but with different headers and the amusing bit is that the emails in question wheren't spam! It's from a mailing list I used to run years ago where one of the members apparently changed his email address, and for a period of time each message that went out caused his automated system to send an update to the list.
And of course, he didn't unsubscribe his old email address.
Heh.
The tracking was done to keep from indexing duplicate emails (since my testing corpus is 1,600 mbox files, some of which may be backups—I don't know which ones though, which is part of the reason I'm writing this program) so in the end I should end up with a set of unique headers.
I got down to 16 emails with duplicate headers, but unique bodies.
That scared me.
A small digression: at this point, the program pulls each email out of the mbox file, and writes the headers into one file (the original, plus a few I add during processing, like the SHA1 hash results) and the body of the email into another file (my dad likes to send me photos and videos in email, so the bodies of those messages tend to be rather large, and I'm concentrating on the headers at the moment). I currently end up with about 50M (megabyte) of headers and almost a gigabyte-worth of email bodies. Now, continuing on …
I pick one of the duplicate hashes, scan for it, and then check the messages:
>
```
>find header_raw/ | xargs grep FFCC3E0BCBF960EBBEA583E77E51CE0CEB59E04D
./000008069:X-SHA1-Header: FFCC3E0BCBF960EBBEA583E77E51CE0CEB59E04D
./000026823:X-SHA1-Header: FFCC3E0BCBF960EBBEA583E77E51CE0CEB59E04D
>grep X-SHA1 header_raw/000008069 header_raw/000026823
header_raw/000008069:X-SHA1-Header: FFCC3E0BCBF960EBBEA583E77E51CE0CEB59E04D
header_raw/000008069:X-SHA1-Body: 5C823DD92D3DCDC5AD43953D72B1D60017A134D6
header_raw/000026823:X-SHA1-Header: FFCC3E0BCBF960EBBEA583E77E51CE0CEB59E04D
header_raw/000026823:X-SHA1-Body: 85584F0167666BAA506E41A3D9ED927227F0FEF0
>
```
(Note: I can't just grep PATTERN * because there are simply too many files (over 45,000) which exceeds the command line limit—that's why I use find and xargs).
Okay, same headers, different body. Just what is going on here? I check the bodies:
>
```
>more body/000008069
Status: RO
Accept All Major Credit Cards!!!
Don't be fooled by the copycats. We are one of the original company's
offering merchant credit card services for all kinds of business's. [sic]
```
This isn't looking good—it looks like my header parsing code is missing a header. What about the other email?
>
```
>more body/000026823
Status: RO
Content-Length: 2815
Lines: 104
Accept All Major Credit Cards!!!
Don't be fooled by the copycats. We are one of the original company's
offering merchant credit card services for all kinds of business's. [sic]
```
Okay, check the mbox files to see what's messing up the header parsing. What I find actually reassures me:
>
```
From cherylg1582@msn.com Wed Dec 12 14:13:00 2001
Return-Path: <cherylg1582@msn.com>
Received: from gig.armigeron.com ([204.29.162.10])
by conman.org (8.8.7/8.8.7) with ESMTP id OAA06543
for <spc.wopr@conman.org>; Wed, 12 Dec 2001 14:12:59 -0500
Received: from mercury.aibusiness.net (emi.net [208.10.128.2]
(may be forged))
by gig.armigeron.com (8.11.0/8.11.0) with ESMTP id fBCJ8Aa31356
for <spc@armigeron.com>; Wed, 12 Dec 2001 14:08:10 -0500
Received: from domainmail.ionet.net (domainmail.ionet.net [206.41.128.18])
by mercury.aibusiness.net (8.9.3/8.9.3) with ESMTP id NAA19835
for <spc@emi.net>; Wed, 12 Dec 2001 13:52:26 -0500
Received: from kqyfqkpby.motor.com (r145h250.afo.net [209.149.145.250]
(may be forged))
by domainmail.ionet.net (8.9.1a/8.7.3) with SMTP id MAA02841;
Wed, 12 Dec 2001 12:38:11 -0600 (CST)
Date: Wed, 12 Dec 2001 12:38:11 -0600 (CST)
Message-Id: <200112121838.MAA02841@domainmail.ionet.net>
From: "griffin" <griffinfpzwrhlllngc@aol.com>
Subject: No fee! Accept Credit Cards for the Holidays! (bbjlm)
Reply-To: elicasabona1787@mailexcel.com
MIME-Version: 1.0
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD NSCPCD47 (Win98; I)
Content-Type: text/plain
Status: RO
Accept All Major Credit Cards!!!
```
It wasn't my code (thank God! The parsing code is getting a bit convoluted at this point), but some clueless spammer trying to add additional headers in the body of the message (the other one was the same). So I'll assume the other 14 “duplicates” are similar in nature—spammers trying to be clever.
And now, back to coding …
[2] http://en.wikipedia.org/wiki/Mbox
[3] http://imranontech.com/2007/02/20/did-bill-gates-say-the-640k-line/