Last month I switched email clients from Thunderbird [1] to mutt [2] (I found Thunderbird to be too sluggish but that's a story for another entry) and configured our primary email server to forward my mail directly to my workstation, where procmail [3] can then filter it.
So now I can burn through mail in about half the time it used to take me.
I get a ton of email, most of it from the various servers (from root mostly) and most of that is generated by the mail system itself, informing me that it's found, yet again, another email infected with a virus (oh, easily 500 a day) or it couldn't deliver a message (another 500 a day easy) or the multi-thousand line output of logwatch [4] (each easily 15,000 lines of summary per day).
So it was a simple matter to set up procmail to filter the messages (and say, automatically delete the virus warnings—I tried turning that off on the servers themselves, but … well … control panels and hidden configuration files and I'm stuck getting them even though I don't care for them). Now, since our mail goes through a dedicated spam filtering system and can mark emails as spam, I thought it would be a good idea to simply delete those upon receipt as well.
Only I kept receiving emails marked as spam.
>
```
31 N Dec 01 trespassers@gre ( 306) [SPAM] Breaking News
```
Puzzled, I moved the procmail configuration to delete such marked spam:
>
```
:0:
* ^Subject: .*SPAM.*
in-TRASH
```
to the start of my .procmailrc, and yet, I still get the emails. I bumped up the verbosity of logging, and yes, some of it was actually being caught and trashed, but not all of it.
What the heck?
In mutt I see:
**From:** <trespassers@greenoblivion.com [5]> > **To:** <apache@XXXXXXXXXXX> > **Subject:** [SPAM] Breaking News > **Date:** Thu, 1 Dec 2005 22:49:10 +0200
But when I checked the actual raw email message …
**From:** <trespassers@greenoblivion.com [6]> > **To:** <apache@XXXXXXXXXXX> > **Subject:** =?ascii?B?W1NQQU1dICBCcmVha2luZyBOZXdz?= > **Date:** Thu, 1 Dec 2005 22:49:10 +0200
That funky subject line? A form of MIME (Multipurpose Internet Mail Extensions) encoding for email headers. In this case, the subject line uses the US-ASCII character set and is encoded as base-64 [7]. procmail knows nothing about MIME encodings. It's looking for “SPAM” in the subject line and not finding it.
Well now …
Obviously, I can add
>
```
:0:
* ^Subject: =\?.*\?W1NQQU1dIC.*
in-TRASH
```
(“[SPAM]” encoded as base-64) to my .procmailrc file, but is there a better way?
Sure, Bayesian filtering [8] is pretty cool, but I still think that a few simple heuristics in place would help just as much.
One idea: check the character encoding of the incoming email. In my case, if it isn't US-ASCII, ISO-8859-1 or UTF-8 (oh, might as well include WINDOWS-1251 for those unfortunate friends that are abused by Microsoft), then discard it. It doesn't matter if it's legitimate email if I don't understand the language it's written in.
Now, with ISO-8859-1, UTF-8 or WINDOWS-1251, I still might not be able to read the message (since ISO-8859-1 and WINDOWS-1251 covers western European langauges like French and German, and UFT-8 covers just about all written languages), but my second idea should take care of that.
Second idea: spell check the incoming email.
No, seriously.
Take this bit of spam I received today:
**lt** is really hard to recollect a company: the market is full of **sugqestions** and the information is overwhelming; but A GOOD CATCHY LOGO, STYLISH STATlONERY and OUTSTANDING **WEBSIT E wilI** make the task much easier.
We do not promise that having ordered a **loqo** your company **wiIl automaticaIly** become a **worId Ieader**: it is quite clear that without good products **,effective** business **orqanization** and **practicable** aim it will be hot at nowadays market; but we do promise that your marketing efforts will become much more effective.
Twelve spelling errors (and one punctuation error, which I marked, but not counting in the following statistic) for a 14% spelling error rate. And if the email is in a different language, the spelling error rate will easily go past 95%. So, if the number of misspelled words exceeds say, 70%, delete it, and if it's above say, 5% (hey, we all make mistakes sometimes) mark it as possible spam.
This would definitely piss off the V1@gr@ pushers.
Third idea: Unless whitelisted, any email that consists of any type of attachment, delete it (well, for me at least).
And this is before explicit filtering, Bayesian or otherwise.
I wonder just how hard something like that would be to write …
[1] http://www.mozilla.org/products/thunderbird/
[5] mailto:trespassers@greenoblivion.com
[6] mailto:trespassers@greenoblivion.com
[7] http://en.wikipedia.org/wiki/Base64