My stats program for The Boston Diaries [1] basically consists of a shells script that calls a custom program (in C) to print out only certain fields from the website logfile which is then fed into a pipeline of some twenty invocations of grep. It basically looks like:
>
```
cat logfile | \
escanlog -status 200 -host -command -agent | \
grep Mozilla | \
grep -v 'Slurp/cat' | \
grep -v 'ZyBorg' | \
grep -v 'bdstyle.css' | \
grep -v 'screen.css' | \
grep -v '^12.148.209.196'| \
grep -v '^4.64.202.64' | \
grep -v '^213.60.99.73' | \
grep -v 'Ask Jeeves' | \
grep -v 'rfp@gmx.net' | \
grep -v '"Mozilla"' | \
grep -v 'Mozilla/4.5' | \
grep -v '.gif ' | \
grep -v '.png ' | \
grep -v '.jpg ' | \
grep -v 'bostondiaries.rss' | \
grep -v 'bd.rss' | \
grep -v 'favicon.ico' | \
grep -v 'robots.txt' | \
grep -v $HOMEIP
```
It's servicable, but it does filter out Lynx [2] and possibly Opera [3] users since I filter for Mozilla and then reject what I don't want. Twenty greps—that's pretty harsh, especially on my server [4]. And given that more and more robots are hiding themselves [5] it seems, the list of exclusions could only get longer and longer.
I think that at this point, a custom program would be much better.
So I wrote one. In C. Why not Perl? Well, I don't know Perl, and I have all the code I really need in C already; there's even a regex library installed on the systems I can call, so that, mixed with the code I already have to parse a website log file, and an extensive library of of C code to handle higher level data structures it wouldn't take me all that long to write the program I wanted.
First, start out with a list of rules:
>
```
# configuration file for filtering web log files
reject host '^XXXXXXXXXXXXX