dbzip2 production testing

The English Wikipedia full-history data dump (my arch-nemesis) died again while building due to a database disconnection. I’ve taken the opportunity to clean up dbzip2 a little more and restart the dump build using it.

The client now handles server connections dropping out, and can even reconnect when they come back, so it should be relatively safe for a long-running process. The remote daemon also daemonizes properly, instead of leaving zombies and breaking your terminal.

Using six remote dbzip2d threads, and the faster 7zip decompression for the data prefetch, I’m getting about 6.5 megabytes per second of (pre-compression XML) throughput average, peaking around 11 mb/sec. A big improvement over what I was measuring with the local threads, by a factor of 5 or so. If this holds up, it should actually complete in “just” two or three days…

Of course that’s assuming the database connection doesn’t drop again! Another thing to improve…

Video crap

A while ago I picked up Motion 2 on a lark to replace the ancient copy of After Effects I occasionally used to do little animation bits. Finally escaped the wiki for a couple hours and got some chance to play with it some more:

Oh my goodness!
1.2MB Ogg Theora (640×360, no sound) (download)

The particle effects are yummy… 720p also fits nicely on my screen while editing. Did a bit in Blender as well; there’s a nice tutorial on WikiBooks (damn, that brings me right back to wiki!)

Stumbled on this while searching for Theora transcoding recommendations.

dbzip2 vincit

I’ve managed to bang my dbzip2 prototype into a pretty decent state now, rewriting some of the lower-level bitstream code as a C module while keeping the high-level bits in Python.

It divides up input into proper-sized blocks and combines output blocks into a single output stream, achieving bit-for-bit compatibility with single-threaded standard bzip2. While still slower than bzip2smp for local threads, I was quite pleased to find it scales to multiple remote threads well enough to really look worth it:

This was using Wikimedia’s database servers; beefy Opteron boxes with gigabit ethernet and usually a lot of idle CPU cycles while they wait on local disk I/O.

The peak throughput on my initial multiple-server tests was about 24 megabytes per second with 10 remote threads, and I was able to get over 19 megs/sec on my full gigabyte test file, compressing it in under a minute. With some further work and better stability, this could be really helpful in getting the big data dumps going faster.

Next step: parallel decompression…?

dbzip2 continues

Still fiddling with distributed bzip2 compression in the hopes of speeding up data dump generation.

My first proof of concept worked similarly to the pbzip2 I came across: input was broken into blocks, each block sent out to local (and remote) threads to be separately compressed as its own little bzip2 stream. The streams were then concatenated together in order.

This works surprisingly well, but has serious compatibility problems. While the standard bzip2 utility happily decompresses all the streams you care to place into one input file, other users of the library functions don’t expect this: the tools I needed to work with the dumps would end with the first stream.

A more compatible implementation should produce a single stream with multiple data blocks in the expected sequence, but the file format doesn’t seem to be well documented.

In my research I came across another parallel bzip2 implementation, bzip2smp. Its author had also found pbzip2 and rejected it, preferring instead to hack the core bzip2 library enough to parallelize the slowest part of it, the Burrows-Wheeler transform. The author claims the output is bit-for-bit identical to the output of regular bzip2, which is obviously attractive.

I’m not 100% sure how easy or practical it would be to extend that to distributed use; the library code is scary and lots of parts are mixed together. Before diving in, I decided to start poking around at the file format and the compression steps to get a better idea of what happened where.

I’ve been putting together some notes on the format. The basic stream itself is slightly complicated by being a true bitstream — blocks in output are not aligned to byte boundaries!

As a further proof of concept I’ve hacked up dbzip2 to combine the separately compressed blocks into a single stream, complete with a combined CRC at the end so bzip2 doesn’t barf and cut off the end of the data. This should be more compatible, but it’s not suitable for use right now: the bit-shifting is done in Python and way too slow. Additionally the input block cutting is probably not totally reliable, and certainly doesn’t produce the same size blocks as real bzip2.

Replicating the RLE performed on input could get the blocks the same size, but at that point it might start to make sense to start using the actual library (or a serious hack of it) instead of the scary Python wrapper I’ve got now.


bzip2 is hideously slow; while looking around for ways to possibly speed it up I stumbled on pbzip2, which exploits the block-based nature of bzip2 compression to parallelize it for potentially linear speedups with an increased number of CPUs.

There are two downsides: first, there are some compatibility issues with third-party decompressor software (probably resolvable), and second it only scales until you run out of local CPUs.

For kicks I'm prototyping a distributed version, dbzip2. In theory, a lot of fast machines on a LAN should be able to speed things up even further.

I've been testing with a Wikipedia dump file that's about a gigabyte, compressing down to about a quarter that.

On my dual-core G5 Power Mac, regular single-threaded bzip2 compresses the file in about 356 seconds (3.122 MB/s input processing).

pbzip2 set for two threads runs about 50% faster; I clocked it at 234 seconds (4.750 MB/s).

My dbzip2 prototype does similar with two local threads, though a touch less efficiently:

Wrote 1295 blocks in 250.9 seconds (4.430 MB/s in, 0.988 MB/s out)

And I can squeeze a few more percentage points out by using remote compressors on the other machines I had lying around:

Wrote 1295 blocks in 188.8 seconds (5.887 MB/s in, 1.313 MB/s out)

The breakdown

Most of the data went to the two local threads on my Power Mac:
local thread: processed 447 blocks in 188.7 seconds (2.033 MB/s in, 0.458 MB/s out)
local thread: processed 444 blocks in 188.7 seconds (2.019 MB/s in, 0.451 MB/s out)

My old Linux box, an Athlon XP 2400+ on 100 Mbit ethernet, took a respectable share:
rdaneel:12345: processed 237 blocks in 188.7 seconds (1.078 MB/s in, 0.238 MB/s out)

CPU usage was pretty high though not maxed out (80-90%), likely due to the tenth of a second spent transferring each 900 KB input block and its compressed output over a 100 Mbit network.

Running the whole file locally it can process 1.344 MB/s.

My old Powerbook (1 GHz G4) and newer Mini (1.5 GHz Core Solo) sit on wireless, which is a lot slower:
verda-majo.local:12345: processed 61 blocks in 188.7 seconds (0.277 MB/s in, 0.060 MB/s out)
philo.local:12345: processed 106 blocks in 188.7 seconds (0.482 MB/s in, 0.106 MB/s out)

These two were clearly limited by the slower network, with CPU usage around just 20-30%. Still, every bit helps!


This shows promise for quickly compressing large files when fast machines are available on a fast network. Slow networks strongly reduce the benefits, but on switched gigabit ethernet things should be nicely CPU-limited even with several fast machines.

The main remaining issues:

  • Whether the bzip2 streams can be stitched together for improved compatibility with decompressor apps
  • Whether similar distribution can be applied to 7zip

Mindless, safe inline XML generation in PHP

Well, almost…

text = $text;
	function toString() {
		return $this->text;

function squash_attribs($attribs) {
	return implode(' ', array_map('squash_attrib',
		array_keys($attribs), array_values($attribs)));

function squash_attrib($key, $value) {
	return htmlspecialchars($key) . '="' . htmlspecialchars($value) . '"';

function xx() {
	$args = func_get_args();
	$name = array_shift($args);
	$next = array_shift($args);
	if (is_array($next)) {
		$attribs = ' ' . squash_attribs($next);
		$next = array_shift($args);
	} else {
		$attribs = '';
	$contents = null;
	while (null !== $next) {
		if (is_object($next)) {
			$contents .= $next->toString();
		} else {
			$contents .= htmlspecialchars( $next );
		$next = array_shift($args);
	if ($contents === null) {
		return new XmlFragment("<$name$attribs />");
	} else {
		return new XmlFragment("<$name$attribs>$contents");

$page = xx("html", array("lang" => "en"),
		xx("h1", "Some thingy"),
		xx("p", "This & that. Nothing important..."),
		xx("h2", "Log in"),
		xx("form", array("method" => "post", "action" => "/submit.php"),
			xx("label", array("for" => "user"), "Username:"),
			" ",
			xx("input", array("name" => "user")),
			xx("label", array("for" => "password"), "Password:"),
			" ",
			xx("input", array("name" => "password", "type" => "password")),
			xx("input", array("type" => "submit"))),
		xx("p", "That's , folks!")));
echo $page->toString();