devel – Page 26

Ubuntu Gutsy vs Parallels

There’s some awful problem with video mode detection in Ubuntu Gutsy on a Parallels 3 virtual machine… I finally got the installer up at 800×600, only to discover that the buttons in the wizard don’t fit on screen:

Further you can’t resize the window vertically.

Luckily I can sort of get it to fit by moving the desktop panels to the sides… :D

Incremental dumps

A follow-up to my previous notes on dumps…

As an optimization to avoid hitting the text storage databases too hard, the wiki XML dumps are done in two passes:

dumpBackup.php --skeleton pulls a consistent snapshot of page and revision metadata to create a “skeleton dump”, without any of the revision text.
dumpTextPass.php reads that XML skeleton, alongside the previous complete XML dump. Revision text that was already present in the previous dump is copied straight over, so only newly created revisions have to be loaded out of the database.

It should be relatively easy to modify this technique to create an incremental dump file, which instead of listing out every page and revision in the entire system would list only those which have changed.

The simplest way to change the dump schema for this might be to add an action attribute to the <page> and <revision> elements, with create, update, and delete values:

<mediawiki>
  <page action="create">
    <!-- Creating a new page -->
    <id>10</id>
    <title>A new page</title>
    <revision action="create">
      <!-- And a new revision. Easy! -->
      <id>100</id>
      <timestamp>2001-01-15T14:03:00Z</title>
      <contributor>...</contributor>
      <text>...</text>
    </revision>
  </page>
  <page action="update">
    <!-- This page has been renamed. Update its record with new values. -->
    <id>11</id>
    <title>New title</title>
    <revision action="create">
      <!-- And a new revision. Easy! -->
      <id>110</id>
      <timestamp>2001-01-15T14:03:00Z</title>
      <contributor>...</contributor>
      <comment>Renamed from "Old title" to "New title"</comment>
      <text>...</text>
    </revision>
  </page>
  <page action="delete">
    <!-- This page has been deleted -->
    <id>12</id>
    <revision action="delete">
      <id>120</id>
    </revision>
  </page>
</mediawiki>

Perhaps those could be moved down to finer granularity for instance to indicate whether a page title was changed or not etc to avoid unnecessary updates, but I’m not sure how much it’d really matter.

There are a few scenarios to take into account as far as interaction with unique keys:

Page titles (page_namespace,page_title): a page rename can cause a temporary conflict between two pages between one application and the next.
Revision IDs (rev_id): History merges could cause a revision to be ‘added’ to one page, and ‘removed’ from another which appears later in the data set. The insertion would trigger a key conflict.

We could try a preemptive UPDATE to give conflicting pages a non-conflicting temporary title, or we could perhaps use REPLACE INTO instead of INSERT INTO in all cases… that could leave entries deleted during the application, but they should come back later on so the final result is consistent.

In my quick testing, REPLACE performs just as well as INSERT when there are no conflicts, and not _insanely_ bad even when there are (about 80% slower in my unscientific benchmark), so when conflicts are rare that’s probably just fine. At least for MySQL targets. :D

Test imports of ia.wikipedia.org full-history dump; SQL generated by MWDumper, importing into MySQL 5.0, best time for each run:

$ time mysql -u root working < insert.sql 

real    0m20.819s
user    0m5.537s
sys     0m0.648s

Modified to use REPLACE instead of INSERT, on a fresh empty database:

$ time mysql -u root working < replace.sql 

real    0m20.557s
user    0m5.530s
sys     0m0.643s

Importing completely over a full database:

$ time mysql -u root working < replace.sql 

real    0m34.109s
user    0m5.533s
sys     0m0.641s

So that's probably feasible. :)

In theory an incremental dump could be made against a previous skeleton dump as well as against full dumps, which would make it possible to create additional incremental dumps even if full-text dumps fail or are skipped.

What I want for Christmas

Pretty please, Santa MySQL, I want a WL#1213: Implement 4-byte UTF8, UTF16 and UTF32…

Oh, and an iPhone!

You’ll shoot your eye out, kid!

Wiki data dumps

There’s a few things we can do to fix up the data dump process again…

New machines: Moving the dump runners from the old boxes they’re on to a couple of our newer quad-core boxes should improve things.
Improve parallelism: When generating bzip2 files, ensuring that the dbzip2 configuration is set up properly may help.
For .7z files, not sure… There’s a note in the changelog for p7zip that LZMA compression is multithreaded as of 4.52; if that gives a good speedup on 2x and 4x boxes, that could be very attractive.

Figuring out more ways to split across machines could be beneficial as well.
Improve read speed: 7z is a *lot* faster to decompress than bzip2. Using the prior .7z dump instead of the prior .bz2 could help speed things up, but last time I tried that I had problems with the pipes not closing properly, leading to hangs at the end.
More robust dumpTextPass: the longer it takes, the more likely it is to die due to a database burp. If the actual database-reading part is pushed out to a subprocess, that can be easily restarted after an error while the parent process, which is based around reading stub and previous XML files, keeps on going.
Splitting files? There’s some thought that dividing the dumps into smaller pieces might be more reliable, as each piece can be re-run separately if it breaks — as well as potentially run in parallel on separate boxes.

It may also be worth dropping the .bz2 files in favor of .7z, especially if we can speed up the compression.

Also have to check if J7Zip can decompress the files and if it’d be possible to integrate it into mwdumper; I don’t like having to rely on an external tool to decompress the files. I hadn’t had any luck with the older java_lzma.

Update: Just for fun, I tried compressing & decompressing a 100 meg SQL file chunk with various proggies on my iMac (x86_64 Linux, 2.16 GHz Core 2 Duo). Parallelism is measured here as (user time / (real time – system time)) as an approximation of multi-core CPU utilization; left out where only one CPU is getting used.

Program	Comp time	Comp parallelism	Decomp time
gzip	10.354 s		1.616 s
bzip2	17.018 s		5.542 s
dbzip2	10.136 s	1.95x
7zr 4.43	81.603 s	1.47x	3.771 s
7za 4.55	98.201 s	1.46x	3.523 s

Nothing seems to be doing parallelism on decompression.

p7zip seems to be using the second CPU for something during compression, but it’s not fully utilized, which leads me to suspect it won’t get any faster on a four-way or eight-way box. dbzip2 should scale up a couple more levels, and can use additional nodes over the network, but you’re still stuck with slow decompression.

Update: I noticed that 7za also can work with .bz2 archives — and it does parallelize decompression! Nice.

Program	Comp time	Comp parallel	Decomp time	Decomp parallel
7za 4.55 .bz2	17.668 s	1.99x	2.999 s	1.91x

The compression speed is worse than regular bzip2, but the decompression is faster. Innnnteresting…

Update: Some notes on incremental dump generation…

Mac v Linux

I first switched to the Mac in ’03 after a few years of being a mostly Linux/BSD guy. Aside from the ability to test Wikipedia in Mac browsers, I was drawn by the oh-so-cute factor of the 12″ aluminum PowerBook and more importantly the way it actually was able to detect its included hardware and attached monitors. ;)

Four years later, desktop Linux is better than ever but still tends to fall down and wet itself when doing things like configuring a multimonitor configuration or installing Flash and Java plugins in 64-bit mode. I’d be afraid to even try it on a laptop without knowing that sleep/wake and external monitor hookup work properly on that exact model.

But when I switched I promised myself I would retain my freedom to switch back. Today I’m using a Mac laptop and a Linux desktop together in the office; if I wanted to switch 100% to Linux, what would I need to change?…

Mac app	Linux app
Firefox		Ahh, open source. :)
Thunderbird
Gimp
NeoOffice	OpenOffice
TextMate / BBEdit	gedit? jEdit? Eclipse?	I haven’t really been happy with *nix GUI editors. Emacs is not an acceptable option. ;) I need a good project-wide regex search/replace, good charset support, ability to open & save files over SFTP, and syntax highlighting/sensitivity that doesn’t interfere with my indenting. Being easy to load files from a terminal and not sucking are pluses.
Yojimbo	Tomboy?	I use Yojimbo constantly for notes, scratch space, web receipts, chat snippets, todo lists, reference cheat sheets, anything and everything. Simple as it is, I love this app! The closest thing I’ve used on *nix is Tomboy, but it doesn’t feel as smooth to me. I’ll just have to fiddle with it more… figuring out how to import all my existing data would be another issue.
QuickSilver	Gnome desktop launcher?	I’ve found QuickSilver invaluable for launching various apps… I used to switch to Terminal and run ‘open -a Firefox’ and such. ;) I think the new launcher which will be included with Ubuntu Gutsy will serve okay on this, though I haven’t tried it.
Keynote	OpenOffice Impress	Wonder if it’s got the nice preview-on-second-screen that Keynote does.
Parallels	VMWare Workstation	Already use this on my office Linux box.
iChat	Pidgin	Been using Pidgin a bit on my Linux box in the office; it’s pretty decent these days.
Colloquy	XChat-GNOME	Kind of awkward, but I haven’t found an IRC client I’m happier with on *nix.
Google Earth	Google Maps	I haven’t had any luck getting the Linux version of Google Earth to run on my office box, but the web version is usually fine.
iTunes	RhythmBox	I’d have to strip DRM from my iTMS tracks, but that’s certainly doable. Don’t know whether it’ll be able to sync with an iPhone, though. ;)
iPhoto	F-Spot	I took a quick peek on the F-Spot web site and was surprised to find nothing about importing from iPhoto. Should be doable; the photos are all just JPEGs and the metadata’s in some kind of XML last I looked.
NetNewsWire	?	I haven’t found a good RSS reader on *nix yet.
iCal	Evolution calender? Sunbird?	I guess I could use Google Calender, but it’s kind of nice to have something that works locally.

The biggest lapse if I switch at home would be in the video editing / multimedia end of things, which I dabble in sometimes and keep meaning to get back into more. I’m pretty happy with the Apple pro apps (Final Cut, Motion, etc), and there’s not really much touching that in Linux-land.

Mailman sucks

GNU Mailman kinda sucks, but like democracy I’ve yet to come across something better. ;)

A few things I’d really like to see improved:

Web archive links in footer
I find I fairly commonly read something on a list and then want to discuss it with other folks in chat. To point them at the same message I was reading, I have to pull up the web archives, then poke around to find it, then copy the link.

In an ideal world, the message footer could include a link to the same message on the web archives, and I could just copy and paste.
Thread moderation
It should be easy to place an out-of-control thread on moderation. I’ll be honest, I can’t figure out how to do it right now. There’s _spam_ filtering, but we discard that. There’s whole-list moderation. There’s per-user moderation. But how do you moderate a particular thread?
Throttling
In some cases, a simple time-delay throttle can help calm things down without actually forcing a moderator to sit there and approve messages. It can feel “fairer” too, since you’re not singling out That One Guy Who Keeps Posting In That One Thread.
Easy archive excision
On public mailing lists, sometimes people post private information accidentally (phone numbers in the signature, private follow-up accidentally sent to list, etc) which they then ask to be removed from the archives. People can understandably get a bit worked up over privacy issues, particularly when the Google-juice of a wikimedia.org domain bumps the message to the first Google hit for their name. ;)

Unfortunately it’s a huge pain in the ass to excise a message from the archives in mailman. You have to shut down the mailing list service, edit a multi-megabyte text file with all that month’s messages, carefully so you don’t disturb the message numbering, rebuild the HTML archive pages from the raw files, and then, finally, restart the mailman service. That’s a lot of manual work and some service outage balanced against having people scream at you, arguably justifiably, and for now we’ve ended up simply disabling crawler access to the archives to keep them out of high-ranked global search indexes. (I know you disagree with this, Timwi. It’s okay, we still love you!)

If we could strike a message from the archives in a one-touch operation, the way we can unsubscribe someone who can’t figure out the unsubscribe directions, we could switch those crawlers back on and make it easier to search the list.
Integrated archive search
We’ve experimented with the htdig integration patch, but the search results are not terribly good and the indexing performance is too slow on our large archives. Even if we get Google etc going again, it’d be nice to have an integrated search that’s a little more polished.

So you wanna be a MediaWiki coder?

Some easy bugs to cut your teeth on…

Bug 1600 – clean up accidental == header markup == in new sections. (Note — there’s an unrelated patch which got posted on this bug by mistake ages ago, just ignore it. :)
Bug 11389 – current diff views probably should clear watchlist update notifications generally, as they do for talk page notifications.
Bug 11380 – the ‘Go’ search shortcut needs some namespace option lovin’…

Or maybe you’re prefer to clean up an old patch and get it ready to go?

Bug 900 – Fix category column spacing. Since letter headers take up more space than individual lines, we get oddly balanced columns if some letters are better represented than others…
Age: 2 years, 7 months. Ouch! :)
Patch status: Applied with only minor cleanup, this function hasn’t changed much! There seems to be something wrong with the algorithm here; while it seems to balance a bit better, I see items dropped off the end of the list sometimes. Needs more work.
Bug 1433 – HTML meta info for interlanguage links.
Age: Two years, seven months.
Patch status: Applied after minor changes, but doesn’t seem compatible. Provided an alternate version which seems to work with SeaMonkey and Lynx. Is this an appropriate thing, and how do we i18nize the link text?

rsync 3.0 crashy :( [fixed!]

rsync 3.0 may rock, but it’s also kinda crashy. :(

It’s died a couple of times while syncing up our ~2TB file upload backup. I’ve attached some gdb processes to try and at least get some backtraces next time it goes.

Update: Got a backtrace…

#0  0x00002b145663886b in free () from /lib/libc.so.6
#1  0x000000000043277b in pool_free_old (p=0x578730, addr=) at lib/pool_alloc.c:266
#2  0x0000000000404374 in flist_free (flist=0x89e960) at flist.c:2239
#3  0x000000000040ddbc in check_for_finished_files (itemizing=1, code=FLOG, check_redo=0) at generator.c:1851
#4  0x000000000040e189 in generate_files (f_out=4, local_name=) at generator.c:1960
#5  0x000000000041753c in do_recv (f_in=5, f_out=4, local_name=0x0) at main.c:774
#6  0x00000000004177ac in client_run (f_in=5, f_out=, pid=25539, argc=, 
    argv=0x56cf98) at main.c:1021
#7  0x0000000000418706 in main (argc=2, argv=0x56cf90) at main.c:1185

Looks like others may have seen it in the wild but a fix doesn’t seem to be around yet. Some sort of bug in the extent allocation pool freeing changes done in May 2007, I think.

Found and patched a probably unrelated bug in pool_alloc.c.

Further updated next day:

The crashy bug should be fixed now. Yay!

PHP 5.2.4 error reporting changes

Noticed a couple neat bits combing through the changlogs for the PHP 5.2.4 release candidate…

Changed “display_errors” php.ini option to accept “stderr” as value whichmakes the error messages to be outputted to STDERR instead of STDOUT with CGI and CLI SAPIs (FR #22839). (Jani)

This warms the cockles of my heart! We do a lot of command-line maintenance scripts for MediaWiki, and it’s rather annoying to have error output spew to stdout by default. Being able to direct it to stderr, where it won’t interfere with the main output stream, should be very nice.

Changed error handler to send HTTP 500 instead of blank page on PHP errors. (Dmitry, Andrei Nigmatulin)

This in theory should give nicer results for when the software appears to *just die* for no reason — with display_errors off, if you hit a PHP fatal error the code just stops and nothing else gets output. In an app that does its processing before any output, the result is a blank page with no cues to the user as to what happened.

Unfortunately it looks like it’s only going to be a help to machine processing, and even then only for the blank-page case. :(

In my quick testing, I only get the 500 error when there was no output done… and it *still* returns blank output, it just comes with the 500 result code.

The plus side is this should keep blank errors out of Google and other search indexes; the minus side is it won’t help with fatal errors that come in the middle of output, or the rather common case of sites which leave display_errors on… because then the error message gets output, so you don’t get a 500 result code.

rsync 3.0 rocks!

Wikimedia’s public image and media file uploads archive has been growing and growing and growing over the years, nowadays easily eating 1.5 TB or so.

This has made it harder to provide publicly downloadable copies, as well as to maintain our own internal backup copies — and not having backups in a hurricane zone is considered bad form.

In the terabyte range, making a giant tar archive is kind of… difficult. Not only is it insanely hard to download the whole thing if you want it, but it multiplies our space requirements — you need space for every complete and variant archive as well as all the original files. Plus it just takes forever to make them.

rsync seems like a natural fit for updating and then synchronizing a large set of medium-size files, but as we’ve grown it became slower and slower and slower.

The problem was that rsync worked in two steps:

First, it scanned the directory trees to list all the files it might have to transfer.

Then, once that was done, it transferred the ones which needed to be updated.

Great for a few thousand files — rotten for a few million! On an internal test, I found that after a full day the rsync process was still transferring nothing but filenames — and its in-memory file list was using 2.6 *gigabytes* of RAM.

Not ideal. :)

Searching around, I stumbled upon the interesting fact that the upcoming rsync 3.0 does “incremental recursion” — that is, it does that little “list, then transfer” cycle for each individual directory instead of for the entire file set at once.

I grabbed a development tree from CVS, compiled it up, and gave it a try — within seconds I started to see files actually being transferred.

Hallelujah! rsync works again…

We’re now in process of getting our internal backup synced up again, and will see about getting an off-site backup and maybe a public rsync 3.0 server set up.