Wiki dumps… in-dump revision diffs?

In breaks between fundraiser stuff I’m investigating patching up the dumps to behave nicer. The biggest problem to date has been how to get full-history dumps generated in a reasonable amount of time and with greater reliability.

As previously explored, the compression of the files is itself a pretty big part of the burden; cleaning up the bottleneck here could allow improvements in the other processing to shine. Effective compression takes a lot of CPU, though, especially the 7-zip LZMA that does so well on the history dumps.

An idea that gets tossed around from time to time is storing diffs of text revision-to-revision; most edits only change a paragraph or two, so only storing the change can save a lot of space. Any differential system introduces complexity and potentially could be fragile, but it ain’t an awful idea.

Our own internal storage has a frightening amalgam of external database shards, batch compression, and character encoding conversion, which is something we try to hide by doing the dumps as version-independent XML. :)

I’m experimenting a bit with hacking something that looks more or less like a standard unified diff into the exporter, which would be fairly easy to implement a re-patcher for on import.

Testing with a tiny chunk of the English Wikipedia which contains a few thousand revisions of [[Wikipedia:Anarchism|Anarchism]]… the diff-laden version is about 18M for the 3687-revision file, versus 194M for the fully expanded version.

Not bad. :)

7-Zip compresses them both down to about 408K… but the smaller file takes a tenth the time to do so. Even gzip and bzip2 do an order of magnitude better compressing the smaller files.

My first pass adapted the PHP diff class we use for in-wiki diffs… It’s a bit sluggish, but combined with bzip2 compression it beats the diffless version by some margin. Using a faster C++ diff and fixing up the output to be actually usable, this might save a lot of time…

Of course all software using the dumps would have to be updated to understand the diff bits, and I’ll have to decide between in-text diff formatting or light XML markup… :)

I love you, internet

I’ve been grumbling about how hard it is to use apartment rental search sites when location is so important, but you have to click five times for each listing before you get to a map. Wouldn’t it be nice if you could do a search and have it just show the results on a map?

Somebody’s done just that: plots out Craigslist rental listings via Google Maps. SWEET!

I guess that’s Web 2.0 or something? Who knew it would be useful…

Greener Wikimedia Foundation?

Been reading Philip José Farmer’s Dayworld series and found myself thinking about that dang ol’ environment. In the Dayworld universe, a future society “solves” pollution and resource shortages by keeping 1/6 of the world’s population in suspended animation at any given time. Perhaps an extreme solution… :)

Someone once suggested Wikimedia could get into the carbon credits market… which sounds like a lovely scam we should avoid at all costs. ;) But what can we really do to be greener? What is WMF’s actual footprint…?

  • The servers
    We run a few hundred computers 24/7; they and their air conditioning suck up electricity. Could we be more efficient about our power usage? Do our newer 2x quad-core boxes pump more page views per kilowatt than our older machines, and if so should we retire the older ones? Should we investigate blade servers or Sun’s “CoolThreads” systems again?

  • The office
    The main office houses a handful of employees; it’s no BigCo but every bit helps, right? There’s lights, computers, air conditioning, and of course the impact of a few people commuting every day. Moving to San Francisco will make most of those commutes practical by public transportation instead of automobiles, and the more moderate climate could save on electricity spent running the AC.

  • Jet-setting
    We run an international conference every year, as well as smaller meetings and individual speaking engagements. What’s the impact of several hundred people taking transcontinental or intercontinental flights? Can we or should we reduce the amount of travel?

Ubuntu Gutsy vs Parallels

There’s some awful problem with video mode detection in Ubuntu Gutsy on a Parallels 3 virtual machine… I finally got the installer up at 800×600, only to discover that the buttons in the wizard don’t fit on screen:

Further you can’t resize the window vertically.

Luckily I can sort of get it to fit by moving the desktop panels to the sides… :D


Incremental dumps

A follow-up to my previous notes on dumps

As an optimization to avoid hitting the text storage databases too hard, the wiki XML dumps are done in two passes:

  1. dumpBackup.php --skeleton pulls a consistent snapshot of page and revision metadata to create a “skeleton dump”, without any of the revision text.
  2. dumpTextPass.php reads that XML skeleton, alongside the previous complete XML dump. Revision text that was already present in the previous dump is copied straight over, so only newly created revisions have to be loaded out of the database.

It should be relatively easy to modify this technique to create an incremental dump file, which instead of listing out every page and revision in the entire system would list only those which have changed.

The simplest way to change the dump schema for this might be to add an action attribute to the <page> and <revision> elements, with create, update, and delete values:

  <page action="create">
    <!-- Creating a new page -->
    <title>A new page</title>
    <revision action="create">
      <!-- And a new revision. Easy! -->
  <page action="update">
    <!-- This page has been renamed. Update its record with new values. -->
    <title>New title</title>
    <revision action="create">
      <!-- And a new revision. Easy! -->
      <comment>Renamed from "Old title" to "New title"</comment>
  <page action="delete">
    <!-- This page has been deleted -->
    <revision action="delete">

Perhaps those could be moved down to finer granularity for instance to indicate whether a page title was changed or not etc to avoid unnecessary updates, but I’m not sure how much it’d really matter.

There are a few scenarios to take into account as far as interaction with unique keys:

  • Page titles (page_namespace,page_title): a page rename can cause a temporary conflict between two pages between one application and the next.
  • Revision IDs (rev_id): History merges could cause a revision to be ‘added’ to one page, and ‘removed’ from another which appears later in the data set. The insertion would trigger a key conflict.

We could try a preemptive UPDATE to give conflicting pages a non-conflicting temporary title, or we could perhaps use REPLACE INTO instead of INSERT INTO in all cases… that could leave entries deleted during the application, but they should come back later on so the final result is consistent.

In my quick testing, REPLACE performs just as well as INSERT when there are no conflicts, and not _insanely_ bad even when there are (about 80% slower in my unscientific benchmark), so when conflicts are rare that’s probably just fine. At least for MySQL targets. :D

Test imports of full-history dump; SQL generated by MWDumper, importing into MySQL 5.0, best time for each run:

$ time mysql -u root working < insert.sql 

real    0m20.819s
user    0m5.537s
sys     0m0.648s

Modified to use REPLACE instead of INSERT, on a fresh empty database:

$ time mysql -u root working < replace.sql 

real    0m20.557s
user    0m5.530s
sys     0m0.643s

Importing completely over a full database:

$ time mysql -u root working < replace.sql 

real    0m34.109s
user    0m5.533s
sys     0m0.641s

So that's probably feasible. :)

In theory an incremental dump could be made against a previous skeleton dump as well as against full dumps, which would make it possible to create additional incremental dumps even if full-text dumps fail or are skipped.


Is this dialog showing success or failure?


Look closer…


Wha? I’m still not sure what’s going on, and I still don’t seem to have a search index in the KDE Help Center. Sigh.

Update: htdig wasn’t installed, which it didn’t report very well. After installing I can apparently build the index, but search still fails.

Again, the error doesn’t get reported well in the UI — it just echoes the command line instead of explaining what went wrong:

$ –docbook –indexdir=/home/brion/.kde/share/apps/khelpcenter/index/ –config=kde_application_manuals –words=multi-file+search –method=and –maxnum=5 –lang=en
Can’t execute htsearch at ‘/srv/www/cgi-bin/htsearch’.


Wiki data dumps

There’s a few things we can do to fix up the data dump process again…

  • New machines: Moving the dump runners from the old boxes they’re on to a couple of our newer quad-core boxes should improve things.
  • Improve parallelism: When generating bzip2 files, ensuring that the dbzip2 configuration is set up properly may help.

    For .7z files, not sure… There’s a note in the changelog for p7zip that LZMA compression is multithreaded as of 4.52; if that gives a good speedup on 2x and 4x boxes, that could be very attractive.

    Figuring out more ways to split across machines could be beneficial as well.

  • Improve read speed: 7z is a *lot* faster to decompress than bzip2. Using the prior .7z dump instead of the prior .bz2 could help speed things up, but last time I tried that I had problems with the pipes not closing properly, leading to hangs at the end.
  • More robust dumpTextPass: the longer it takes, the more likely it is to die due to a database burp. If the actual database-reading part is pushed out to a subprocess, that can be easily restarted after an error while the parent process, which is based around reading stub and previous XML files, keeps on going.
  • Splitting files? There’s some thought that dividing the dumps into smaller pieces might be more reliable, as each piece can be re-run separately if it breaks — as well as potentially run in parallel on separate boxes.

It may also be worth dropping the .bz2 files in favor of .7z, especially if we can speed up the compression.

Also have to check if J7Zip can decompress the files and if it’d be possible to integrate it into mwdumper; I don’t like having to rely on an external tool to decompress the files. I hadn’t had any luck with the older java_lzma.

Update: Just for fun, I tried compressing & decompressing a 100 meg SQL file chunk with various proggies on my iMac (x86_64 Linux, 2.16 GHz Core 2 Duo). Parallelism is measured here as (user time / (real time – system time)) as an approximation of multi-core CPU utilization; left out where only one CPU is getting used.

Program Comp time Comp parallelism Decomp time
gzip 10.354 s 1.616 s
bzip2 17.018 s 5.542 s
dbzip2 10.136 s 1.95x
7zr 4.43 81.603 s 1.47x 3.771 s
7za 4.55 98.201 s 1.46x 3.523 s

Nothing seems to be doing parallelism on decompression.

p7zip seems to be using the second CPU for something during compression, but it’s not fully utilized, which leads me to suspect it won’t get any faster on a four-way or eight-way box. dbzip2 should scale up a couple more levels, and can use additional nodes over the network, but you’re still stuck with slow decompression.

Update: I noticed that 7za also can work with .bz2 archives — and it does parallelize decompression! Nice.

Program Comp time Comp parallel Decomp time Decomp parallel
7za 4.55 .bz2 17.668 s 1.99x 2.999 s 1.91x

The compression speed is worse than regular bzip2, but the decompression is faster. Innnnteresting…

Update: Some notes on incremental dump generation

Changing AAA memberships when moving?

For those not familiar with the custom, here in the US of A it’s not uncommon for people to get a membership with the AAA (American Automobile Association). For your modest annual dues you get access to basic roadside assistance if you have car trouble (jumpstarting, towing, etc) as well as various discounted insurance and travel services.

I’ve had a membership for years and only used a few services, but it gives me a warm fuzzy feeling to know I have it if I need it.

The one funny thing is that AAA isn’t a single organization, but a federation of many different regional clubs. My membership is with the Automobile Club of Southern California; when I moved to Florida I just renewed it rather than trying to switch to the Auto Club South, in part because I wasn’t sure how long I’d be here.

And sure enough, it looks like I’ll be moving back to California next year — but to San Francisco, which is in the California State Automobile Association‘s territory (Utah, Nevada, and Northern California).

The question that I’ve had surprisingly little luck googling or searching FAQs is — is there any benefit to switching my membership to the local club? Would I lose access to insurance I already have through AAA if I did? Would I have trouble getting new insurance if I’m living outside their main territory?

I guess I should call them and ask. Bah. I hate the telephone.

Mac v Linux

I first switched to the Mac in ’03 after a few years of being a mostly Linux/BSD guy. Aside from the ability to test Wikipedia in Mac browsers, I was drawn by the oh-so-cute factor of the 12″ aluminum PowerBook and more importantly the way it actually was able to detect its included hardware and attached monitors. ;)

Four years later, desktop Linux is better than ever but still tends to fall down and wet itself when doing things like configuring a multimonitor configuration or installing Flash and Java plugins in 64-bit mode. I’d be afraid to even try it on a laptop without knowing that sleep/wake and external monitor hookup work properly on that exact model.

But when I switched I promised myself I would retain my freedom to switch back. Today I’m using a Mac laptop and a Linux desktop together in the office; if I wanted to switch 100% to Linux, what would I need to change?…

Mac app Linux app
Firefox Ahh, open source. :)
NeoOffice OpenOffice
TextMate / BBEdit gedit? jEdit? Eclipse? I haven’t really been happy with *nix GUI editors. Emacs is not an acceptable option. ;)

I need a good project-wide regex search/replace, good charset support, ability to open & save files over SFTP, and syntax highlighting/sensitivity that doesn’t interfere with my indenting.

Being easy to load files from a terminal and not sucking are pluses.

Yojimbo Tomboy? I use Yojimbo constantly for notes, scratch space, web receipts, chat snippets, todo lists, reference cheat sheets, anything and everything.

Simple as it is, I love this app! The closest thing I’ve used on *nix is Tomboy, but it doesn’t feel as smooth to me. I’ll just have to fiddle with it more… figuring out how to import all my existing data would be another issue.

QuickSilver Gnome desktop launcher? I’ve found QuickSilver invaluable for launching various apps… I used to switch to Terminal and run ‘open -a Firefox’ and such. ;) I think the new launcher which will be included with Ubuntu Gutsy will serve okay on this, though I haven’t tried it.
Keynote OpenOffice Impress Wonder if it’s got the nice preview-on-second-screen that Keynote does.
Parallels VMWare Workstation Already use this on my office Linux box.
iChat Pidgin Been using Pidgin a bit on my Linux box in the office; it’s pretty decent these days.
Colloquy XChat-GNOME Kind of awkward, but I haven’t found an IRC client I’m happier with on *nix.
Google Earth Google Maps I haven’t had any luck getting the Linux version of Google Earth to run on my office box, but the web version is usually fine.
iTunes RhythmBox I’d have to strip DRM from my iTMS tracks, but that’s certainly doable. Don’t know whether it’ll be able to sync with an iPhone, though. ;)
iPhoto F-Spot I took a quick peek on the F-Spot web site and was surprised to find nothing about importing from iPhoto. Should be doable; the photos are all just JPEGs and the metadata’s in some kind of XML last I looked.
NetNewsWire ? I haven’t found a good RSS reader on *nix yet.
iCal Evolution calender? Sunbird? I guess I could use Google Calender, but it’s kind of nice to have something that works locally.

The biggest lapse if I switch at home would be in the video editing / multimedia end of things, which I dabble in sometimes and keep meaning to get back into more. I’m pretty happy with the Apple pro apps (Final Cut, Motion, etc), and there’s not really much touching that in Linux-land.