dbzip2 vincit

I’ve managed to bang my dbzip2 prototype into a pretty decent state now, rewriting some of the lower-level bitstream code as a C module while keeping the high-level bits in Python.

It divides up input into proper-sized blocks and combines output blocks into a single output stream, achieving bit-for-bit compatibility with single-threaded standard bzip2. While still slower than bzip2smp for local threads, I was quite pleased to find it scales to multiple remote threads well enough to really look worth it:

This was using Wikimedia’s database servers; beefy Opteron boxes with gigabit ethernet and usually a lot of idle CPU cycles while they wait on local disk I/O.

The peak throughput on my initial multiple-server tests was about 24 megabytes per second with 10 remote threads, and I was able to get over 19 megs/sec on my full gigabyte test file, compressing it in under a minute. With some further work and better stability, this could be really helpful in getting the big data dumps going faster.

Next step: parallel decompression…?

dbzip2 continues

Still fiddling with distributed bzip2 compression in the hopes of speeding up data dump generation.

My first proof of concept worked similarly to the pbzip2 I came across: input was broken into blocks, each block sent out to local (and remote) threads to be separately compressed as its own little bzip2 stream. The streams were then concatenated together in order.

This works surprisingly well, but has serious compatibility problems. While the standard bzip2 utility happily decompresses all the streams you care to place into one input file, other users of the library functions don’t expect this: the tools I needed to work with the dumps would end with the first stream.

A more compatible implementation should produce a single stream with multiple data blocks in the expected sequence, but the file format doesn’t seem to be well documented.

In my research I came across another parallel bzip2 implementation, bzip2smp. Its author had also found pbzip2 and rejected it, preferring instead to hack the core bzip2 library enough to parallelize the slowest part of it, the Burrows-Wheeler transform. The author claims the output is bit-for-bit identical to the output of regular bzip2, which is obviously attractive.

I’m not 100% sure how easy or practical it would be to extend that to distributed use; the library code is scary and lots of parts are mixed together. Before diving in, I decided to start poking around at the file format and the compression steps to get a better idea of what happened where.

I’ve been putting together some notes on the format. The basic stream itself is slightly complicated by being a true bitstream — blocks in output are not aligned to byte boundaries!

As a further proof of concept I’ve hacked up dbzip2 to combine the separately compressed blocks into a single stream, complete with a combined CRC at the end so bzip2 doesn’t barf and cut off the end of the data. This should be more compatible, but it’s not suitable for use right now: the bit-shifting is done in Python and way too slow. Additionally the input block cutting is probably not totally reliable, and certainly doesn’t produce the same size blocks as real bzip2.

Replicating the RLE performed on input could get the blocks the same size, but at that point it might start to make sense to start using the actual library (or a serious hack of it) instead of the scary Python wrapper I’ve got now.

dbzip2

bzip2 is hideously slow; while looking around for ways to possibly speed it up I stumbled on pbzip2, which exploits the block-based nature of bzip2 compression to parallelize it for potentially linear speedups with an increased number of CPUs.

There are two downsides: first, there are some compatibility issues with third-party decompressor software (probably resolvable), and second it only scales until you run out of local CPUs.

For kicks I'm prototyping a distributed version, dbzip2. In theory, a lot of fast machines on a LAN should be able to speed things up even further.

I've been testing with a Wikipedia dump file that's about a gigabyte, compressing down to about a quarter that.

On my dual-core G5 Power Mac, regular single-threaded bzip2 compresses the file in about 356 seconds (3.122 MB/s input processing).

pbzip2 set for two threads runs about 50% faster; I clocked it at 234 seconds (4.750 MB/s).

My dbzip2 prototype does similar with two local threads, though a touch less efficiently:

Wrote 1295 blocks in 250.9 seconds (4.430 MB/s in, 0.988 MB/s out)

And I can squeeze a few more percentage points out by using remote compressors on the other machines I had lying around:

Wrote 1295 blocks in 188.8 seconds (5.887 MB/s in, 1.313 MB/s out)

The breakdown

Most of the data went to the two local threads on my Power Mac:
local thread: processed 447 blocks in 188.7 seconds (2.033 MB/s in, 0.458 MB/s out)
local thread: processed 444 blocks in 188.7 seconds (2.019 MB/s in, 0.451 MB/s out)

My old Linux box, an Athlon XP 2400+ on 100 Mbit ethernet, took a respectable share:
rdaneel:12345: processed 237 blocks in 188.7 seconds (1.078 MB/s in, 0.238 MB/s out)

CPU usage was pretty high though not maxed out (80-90%), likely due to the tenth of a second spent transferring each 900 KB input block and its compressed output over a 100 Mbit network.

Running the whole file locally it can process 1.344 MB/s.

My old Powerbook (1 GHz G4) and newer Mini (1.5 GHz Core Solo) sit on wireless, which is a lot slower:
verda-majo.local:12345: processed 61 blocks in 188.7 seconds (0.277 MB/s in, 0.060 MB/s out)
philo.local:12345: processed 106 blocks in 188.7 seconds (0.482 MB/s in, 0.106 MB/s out)

These two were clearly limited by the slower network, with CPU usage around just 20-30%. Still, every bit helps!

Conclusions

This shows promise for quickly compressing large files when fast machines are available on a fast network. Slow networks strongly reduce the benefits, but on switched gigabit ethernet things should be nicely CPU-limited even with several fast machines.

The main remaining issues:

  • Whether the bzip2 streams can be stitched together for improved compatibility with decompressor apps
  • Whether similar distribution can be applied to 7zip