dbzip2 vincit

I’ve managed to bang my dbzip2 prototype into a pretty decent state now, rewriting some of the lower-level bitstream code as a C module while keeping the high-level bits in Python.

It divides up input into proper-sized blocks and combines output blocks into a single output stream, achieving bit-for-bit compatibility with single-threaded standard bzip2. While still slower than bzip2smp for local threads, I was quite pleased to find it scales to multiple remote threads well enough to really look worth it:

This was using Wikimedia’s database servers; beefy Opteron boxes with gigabit ethernet and usually a lot of idle CPU cycles while they wait on local disk I/O.

The peak throughput on my initial multiple-server tests was about 24 megabytes per second with 10 remote threads, and I was able to get over 19 megs/sec on my full gigabyte test file, compressing it in under a minute. With some further work and better stability, this could be really helpful in getting the big data dumps going faster.

Next step: parallel decompression…?

One thought on “dbzip2 vincit”

  1. Currently, http://svn.wikimedia.org/svnroot/mediawiki/trunk/dbzip2/README states that decompression is experimental in dbzip2 and only works on single-block files. Fully knowing that it’s not very tasteful to advocate one’s own software in a comment on somebody else’s blog, I’ll still recommend lbzip2 which tries hard to decompress “traditional” bz2 files, read from non-seekable input (pipe, SOCK_STREAM etc), with an IO-bound splitter. See my homepage or http://freshmeat.net/projects/lbzip2/ if you like.

Comments are closed.