Wikimedia data dump update

Quick update on data dump status:

Dumps are back up and running on srv31, the old dump batch host.

Please note that unlike the wikis sites themselves, dump activity is not considered time-critical — there is no emergency requirement to get them running as soon as possible.

Getting dumps running again after a few days is nearly as good as getting them running again immediately. Yes, it sucks when it takes longer than we’d like. No, it’s not the end of the world.

Dump runner redesign is in progress.

I’ve chatted a bit with Tim in the past on rearranging the architecture of the dump system to allow for horizontal scaling, which will make the big history dumps much much faster by distributing the work across multiple CPUs or hosts where it’s currently limited to a single thread per wiki.

We seem to be in agreement on the basic arch, and Tomasz is now in charge of making this happen; he’ll be poking at infrastructure for this over the next few days — using his past experience with distributed index build systems at Amazon to guide his research — and will report to y’all later this week with some more concrete details.

Dump format changes are in progress.

Robert Rohde’s p.o.c code for diff-based dumps is in our SVN and available for testing.

We’ll be looking at what the possibility on integrating this is to see what the effect on dump performance is; currently performance and reliability are our primary concerns, rather than output file size, but they can intersect since the bzip2 data compression is a time factor.

This will be pushed back to later if we don’t see an immediate generation-speed improvement, but it’s very much a desired project since it will make the full-history dump files much smaller.

2 thoughts on “Wikimedia data dump update”

  1. I tried running ConverToEditSyntax on the simple english wikipedia. It got caught in an infinite loop after 266MB of output.

Comments are closed.