Wiki dumps… in-dump revision diffs?

In breaks between fundraiser stuff I’m investigating patching up the dumps to behave nicer. The biggest problem to date has been how to get full-history dumps generated in a reasonable amount of time and with greater reliability.

As previously explored, the compression of the files is itself a pretty big part of the burden; cleaning up the bottleneck here could allow improvements in the other processing to shine. Effective compression takes a lot of CPU, though, especially the 7-zip LZMA that does so well on the history dumps.

An idea that gets tossed around from time to time is storing diffs of text revision-to-revision; most edits only change a paragraph or two, so only storing the change can save a lot of space. Any differential system introduces complexity and potentially could be fragile, but it ain’t an awful idea.

Our own internal storage has a frightening amalgam of external database shards, batch compression, and character encoding conversion, which is something we try to hide by doing the dumps as version-independent XML. :)

I’m experimenting a bit with hacking something that looks more or less like a standard unified diff into the exporter, which would be fairly easy to implement a re-patcher for on import.

Testing with a tiny chunk of the English Wikipedia which contains a few thousand revisions of [[Wikipedia:Anarchism|Anarchism]]… the diff-laden version is about 18M for the 3687-revision file, versus 194M for the fully expanded version.

Not bad. :)

7-Zip compresses them both down to about 408K… but the smaller file takes a tenth the time to do so. Even gzip and bzip2 do an order of magnitude better compressing the smaller files.

My first pass adapted the PHP diff class we use for in-wiki diffs… It’s a bit sluggish, but combined with bzip2 compression it beats the diffless version by some margin. Using a faster C++ diff and fixing up the output to be actually usable, this might save a lot of time…

Of course all software using the dumps would have to be updated to understand the diff bits, and I’ll have to decide between in-text diff formatting or light XML markup… :)