dbzip2

bzip2 is hideously slow; while looking around for ways to possibly speed it up I stumbled on pbzip2, which exploits the block-based nature of bzip2 compression to parallelize it for potentially linear speedups with an increased number of CPUs.

There are two downsides: first, there are some compatibility issues with third-party decompressor software (probably resolvable), and second it only scales until you run out of local CPUs.

For kicks I'm prototyping a distributed version, dbzip2. In theory, a lot of fast machines on a LAN should be able to speed things up even further.

I've been testing with a Wikipedia dump file that's about a gigabyte, compressing down to about a quarter that.

On my dual-core G5 Power Mac, regular single-threaded bzip2 compresses the file in about 356 seconds (3.122 MB/s input processing).

pbzip2 set for two threads runs about 50% faster; I clocked it at 234 seconds (4.750 MB/s).

My dbzip2 prototype does similar with two local threads, though a touch less efficiently:

Wrote 1295 blocks in 250.9 seconds (4.430 MB/s in, 0.988 MB/s out)

And I can squeeze a few more percentage points out by using remote compressors on the other machines I had lying around:

Wrote 1295 blocks in 188.8 seconds (5.887 MB/s in, 1.313 MB/s out)

The breakdown

Most of the data went to the two local threads on my Power Mac:
local thread: processed 447 blocks in 188.7 seconds (2.033 MB/s in, 0.458 MB/s out)
local thread: processed 444 blocks in 188.7 seconds (2.019 MB/s in, 0.451 MB/s out)

My old Linux box, an Athlon XP 2400+ on 100 Mbit ethernet, took a respectable share:
rdaneel:12345: processed 237 blocks in 188.7 seconds (1.078 MB/s in, 0.238 MB/s out)

CPU usage was pretty high though not maxed out (80-90%), likely due to the tenth of a second spent transferring each 900 KB input block and its compressed output over a 100 Mbit network.

Running the whole file locally it can process 1.344 MB/s.

My old Powerbook (1 GHz G4) and newer Mini (1.5 GHz Core Solo) sit on wireless, which is a lot slower:
verda-majo.local:12345: processed 61 blocks in 188.7 seconds (0.277 MB/s in, 0.060 MB/s out)
philo.local:12345: processed 106 blocks in 188.7 seconds (0.482 MB/s in, 0.106 MB/s out)

These two were clearly limited by the slower network, with CPU usage around just 20-30%. Still, every bit helps!

Conclusions

This shows promise for quickly compressing large files when fast machines are available on a fast network. Slow networks strongly reduce the benefits, but on switched gigabit ethernet things should be nicely CPU-limited even with several fast machines.

The main remaining issues:

  • Whether the bzip2 streams can be stitched together for improved compatibility with decompressor apps
  • Whether similar distribution can be applied to 7zip

Mindless, safe inline XML generation in PHP

Well, almost…

text = $text;
	}
	function toString() {
		return $this->text;
	}
}

function squash_attribs($attribs) {
	return implode(' ', array_map('squash_attrib',
		array_keys($attribs), array_values($attribs)));
}

function squash_attrib($key, $value) {
	return htmlspecialchars($key) . '="' . htmlspecialchars($value) . '"';
}

function xx() {
	$args = func_get_args();
	$name = array_shift($args);
	$next = array_shift($args);
	if (is_array($next)) {
		$attribs = ' ' . squash_attribs($next);
		$next = array_shift($args);
	} else {
		$attribs = '';
	}
	$contents = null;
	while (null !== $next) {
		if (is_object($next)) {
			$contents .= $next->toString();
		} else {
			$contents .= htmlspecialchars( $next );
		}
		$next = array_shift($args);
	}
	if ($contents === null) {
		return new XmlFragment("<$name$attribs />");
	} else {
		return new XmlFragment("<$name$attribs>$contents");
	}
}

$page = xx("html", array("lang" => "en"),
	xx("body",
		xx("h1", "Some thingy"),
		xx("p", "This & that. Nothing important..."),
		xx("hr"),
		xx("h2", "Log in"),
		xx("form", array("method" => "post", "action" => "/submit.php"),
			xx("label", array("for" => "user"), "Username:"),
			" ",
			xx("input", array("name" => "user")),
			xx("br"),
			xx("label", array("for" => "password"), "Password:"),
			" ",
			xx("input", array("name" => "password", "type" => "password")),
			xx("br"),
			xx("input", array("type" => "submit"))),
		xx("hr"),
		xx("p", "That's , folks!")));
echo $page->toString();

?>