dbzip2 production testing

The English Wikipedia full-history data dump (my arch-nemesis) died again while building due to a database disconnection. I’ve taken the opportunity to clean up dbzip2 a little more and restart the dump build using it.

The client now handles server connections dropping out, and can even reconnect when they come back, so it should be relatively safe for a long-running process. The remote daemon also daemonizes properly, instead of leaving zombies and breaking your terminal.

Using six remote dbzip2d threads, and the faster 7zip decompression for the data prefetch, I’m getting about 6.5 megabytes per second of (pre-compression XML) throughput average, peaking around 11 mb/sec. A big improvement over what I was measuring with the local threads, by a factor of 5 or so. If this holds up, it should actually complete in “just” two or three days…

Of course that’s assuming the database connection doesn’t drop again! Another thing to improve…

One thought on “dbzip2 production testing”

  1. Hi! I’m looking for ways to speed up the bzip2 process. Is there anything new happening on the dbzip2 front? I’d like to test the software. I have tons of data to compress and a few servers sitting idle most of the time, so remote parallel execution of dbzip2 would be a real time saver! Let me know what’s the status. Thanx

Comments are closed.