2004-11-14 Software

I’m unhappy with SpamStat... Maybe my training was wrong. I think that maybe I used some groups for training as “ham” where in fact the occasional spam message went unnoticed (high traffic mailing lists I didn’t bother to check). I had hoped that spam-stat.el would be more effective. But I guess not. Then I tried to only use a selected few groups for “ham” training. A lot of messages for other non-spam mailing-lists ended up as spam, however. I’ve now decided to finally implement what PaulGraham suggests:

PaulGraham

To anyone who has worked on spam filters, this [ignoring message headers] will seem a perverse decision. And yet in the very first filters I tried writing, I ignored the headers too. Why? Because I wanted to keep the problem neat. I didn’t know much about mail headers then, and they seemed to me full of random stuff. There is a lesson here for filter writers: don’t ignore data. You’d think this lesson would be too obvious to mention, but I’ve had to learn it several times. ¹

So the version I had returned tokens such as “foo” (”foo” occured in the body) and (”Subject” . “foo”)... Paul Graham used “Subject*foo”. Turns out that Paul Graham’s method is much faster. I guess the consing slows it down a lot. Ok, ready to test some more before checking in.

#Software