dbzip2 continues

Still fiddling with distributed bzip2 compression in the hopes of speeding up data dump generation.

My first proof of concept worked similarly to the pbzip2 I came across: input was broken into blocks, each block sent out to local (and remote) threads to be separately compressed as its own little bzip2 stream. The streams were then concatenated together in order.

This works surprisingly well, but has serious compatibility problems. While the standard bzip2 utility happily decompresses all the streams you care to place into one input file, other users of the library functions don’t expect this: the tools I needed to work with the dumps would end with the first stream.

A more compatible implementation should produce a single stream with multiple data blocks in the expected sequence, but the file format doesn’t seem to be well documented.

In my research I came across another parallel bzip2 implementation, bzip2smp. Its author had also found pbzip2 and rejected it, preferring instead to hack the core bzip2 library enough to parallelize the slowest part of it, the Burrows-Wheeler transform. The author claims the output is bit-for-bit identical to the output of regular bzip2, which is obviously attractive.

I’m not 100% sure how easy or practical it would be to extend that to distributed use; the library code is scary and lots of parts are mixed together. Before diving in, I decided to start poking around at the file format and the compression steps to get a better idea of what happened where.

I’ve been putting together some notes on the format. The basic stream itself is slightly complicated by being a true bitstream — blocks in output are not aligned to byte boundaries!

As a further proof of concept I’ve hacked up dbzip2 to combine the separately compressed blocks into a single stream, complete with a combined CRC at the end so bzip2 doesn’t barf and cut off the end of the data. This should be more compatible, but it’s not suitable for use right now: the bit-shifting is done in Python and way too slow. Additionally the input block cutting is probably not totally reliable, and certainly doesn’t produce the same size blocks as real bzip2.

Replicating the RLE performed on input could get the blocks the same size, but at that point it might start to make sense to start using the actual library (or a serious hack of it) instead of the scary Python wrapper I’ve got now.