Compression fever

It's one thing to trick someone on April Fools' Day [1], but to trick yourself really takes talent.

I have that talent.

Sigh.

I started this last week and didn't suspect a thing.

I came across The Hutter Prize [2], which will pay up to 50,000€ to compress a 100,000,000 byte file to less than 16,481,655 bytes (which includes the decompression program). The rules [3] are straightforward, and to me, it certainly seems easier than The Netflix Prize [4].

At the very least, it couldn't hurt to play around a bit.

So last Thursday I downloaded the file to be compressed and played around with it. The file itself is nothing more than Wikipedia entries wrapped in XML (eXtensible Markup Language). The XML only makes up 2.4% of the file—the rest is text (13.5% is nothing but the space character; 8% is the letter “e”).

Friday I started coding. Nothing difficult and it was pretty straightforward (although while the algorithm for Huffman encoding [5] is pretty simple, writing it was surprisingly difficult; it took about five attempts before I was happy with the code). I did a few benchmarks against gzip and bzip2 and was surprised with the results:

Table: My Sooperseekrit compression algorithm vs. gzip and bzip2
Program	Size of archive
------------------------------
orginal file	100,000,000
gzip	36,445,248
bzip2	29,008,736
my attempt	19,774,963

While not less than the required 16,481,655 bytes, it is in the ballpark, and I was quite surprised to beat the snot out of bzip2.

Not bad for a few hours of work.

So today, I'm hacking around with my program when I notice something odd—some of the Huffman encodings are the same.

That's not a good sign—each encoding should be unique. It takes a while (it takes almost 7½ minutes to compress 100,000,000 bytes with my program) but I find the problem—a bug in the encoding routine. Building the Huffman encoding table was fine—it was reading the encoded values that was buggy (and dropping bits).

How buggy?

Table: My fixed Sooperseekrit compression algorithm vs. gzip and bzip2
Program	Size of archive
------------------------------
orginal file	100,000,000
gzip	36,445,248
bzip2	29,008,736
my attempt	35,947,593

Um … yeah.

It's not the first time my hopes for vast fortunes were dashed because of a bug fix [6].

Sigh.

[1] http://en.wikipedia.org/wiki/April_Fools'_Day

[2] http://prize.hutter1.net/

[3] http://prize.hutter1.net/hrules.htm

[4] http://www.netflixprize.com/index

[5] http://en.wikipedia.org/wiki/Huffman_encoding

[6] /boston/2005/06/15.1

Gemini Mention this post

Contact the author