Greener Wikimedia Foundation?

Been reading Philip José Farmer’s Dayworld series and found myself thinking about that dang ol’ environment. In the Dayworld universe, a future society “solves” pollution and resource shortages by keeping 1/6 of the world’s population in suspended animation at any given time. Perhaps an extreme solution… :)

Someone once suggested Wikimedia could get into the carbon credits market… which sounds like a lovely scam we should avoid at all costs. ;) But what can we really do to be greener? What is WMF’s actual footprint…?

  • The servers
    We run a few hundred computers 24/7; they and their air conditioning suck up electricity. Could we be more efficient about our power usage? Do our newer 2x quad-core boxes pump more page views per kilowatt than our older machines, and if so should we retire the older ones? Should we investigate blade servers or Sun’s “CoolThreads” systems again?

  • The office
    The main office houses a handful of employees; it’s no BigCo but every bit helps, right? There’s lights, computers, air conditioning, and of course the impact of a few people commuting every day. Moving to San Francisco will make most of those commutes practical by public transportation instead of automobiles, and the more moderate climate could save on electricity spent running the AC.

  • Jet-setting
    We run an international conference every year, as well as smaller meetings and individual speaking engagements. What’s the impact of several hundred people taking transcontinental or intercontinental flights? Can we or should we reduce the amount of travel?

Ubuntu Gutsy vs Parallels

There’s some awful problem with video mode detection in Ubuntu Gutsy on a Parallels 3 virtual machine… I finally got the installer up at 800×600, only to discover that the buttons in the wizard don’t fit on screen:

installer-boo.png
Further you can’t resize the window vertically.

Luckily I can sort of get it to fit by moving the desktop panels to the sides… :D

installer-yay.png

Incremental dumps

A follow-up to my previous notes on dumps

As an optimization to avoid hitting the text storage databases too hard, the wiki XML dumps are done in two passes:

  1. dumpBackup.php --skeleton pulls a consistent snapshot of page and revision metadata to create a “skeleton dump”, without any of the revision text.
  2. dumpTextPass.php reads that XML skeleton, alongside the previous complete XML dump. Revision text that was already present in the previous dump is copied straight over, so only newly created revisions have to be loaded out of the database.

It should be relatively easy to modify this technique to create an incremental dump file, which instead of listing out every page and revision in the entire system would list only those which have changed.

The simplest way to change the dump schema for this might be to add an action attribute to the <page> and <revision> elements, with create, update, and delete values:

<mediawiki>
  <page action="create">
    <!-- Creating a new page -->
    <id>10</id>
    <title>A new page</title>
    <revision action="create">
      <!-- And a new revision. Easy! -->
      <id>100</id>
      <timestamp>2001-01-15T14:03:00Z</title>
      <contributor>...</contributor>
      <text>...</text>
    </revision>
  </page>
  <page action="update">
    <!-- This page has been renamed. Update its record with new values. -->
    <id>11</id>
    <title>New title</title>
    <revision action="create">
      <!-- And a new revision. Easy! -->
      <id>110</id>
      <timestamp>2001-01-15T14:03:00Z</title>
      <contributor>...</contributor>
      <comment>Renamed from "Old title" to "New title"</comment>
      <text>...</text>
    </revision>
  </page>
  <page action="delete">
    <!-- This page has been deleted -->
    <id>12</id>
    <revision action="delete">
      <id>120</id>
    </revision>
  </page>
</mediawiki>

Perhaps those could be moved down to finer granularity for instance to indicate whether a page title was changed or not etc to avoid unnecessary updates, but I’m not sure how much it’d really matter.

There are a few scenarios to take into account as far as interaction with unique keys:

  • Page titles (page_namespace,page_title): a page rename can cause a temporary conflict between two pages between one application and the next.
  • Revision IDs (rev_id): History merges could cause a revision to be ‘added’ to one page, and ‘removed’ from another which appears later in the data set. The insertion would trigger a key conflict.

We could try a preemptive UPDATE to give conflicting pages a non-conflicting temporary title, or we could perhaps use REPLACE INTO instead of INSERT INTO in all cases… that could leave entries deleted during the application, but they should come back later on so the final result is consistent.

In my quick testing, REPLACE performs just as well as INSERT when there are no conflicts, and not _insanely_ bad even when there are (about 80% slower in my unscientific benchmark), so when conflicts are rare that’s probably just fine. At least for MySQL targets. :D

Test imports of ia.wikipedia.org full-history dump; SQL generated by MWDumper, importing into MySQL 5.0, best time for each run:

$ time mysql -u root working < insert.sql 

real    0m20.819s
user    0m5.537s
sys     0m0.648s

Modified to use REPLACE instead of INSERT, on a fresh empty database:

$ time mysql -u root working < replace.sql 

real    0m20.557s
user    0m5.530s
sys     0m0.643s

Importing completely over a full database:

$ time mysql -u root working < replace.sql 

real    0m34.109s
user    0m5.533s
sys     0m0.641s

So that's probably feasible. :)

In theory an incremental dump could be made against a previous skeleton dump as well as against full dumps, which would make it possible to create additional incremental dumps even if full-text dumps fail or are skipped.

KDE WTF

Is this dialog showing success or failure?

screenshot-build-search-indices-kde-help-center.png

Look closer…

screenshot-build-search-indices-kde-help-center-1.png

Wha? I’m still not sure what’s going on, and I still don’t seem to have a search index in the KDE Help Center. Sigh.

Update: htdig wasn’t installed, which it didn’t report very well. After installing I can apparently build the index, but search still fails.

Again, the error doesn’t get reported well in the UI — it just echoes the khc_htsearch.pl command line instead of explaining what went wrong:

$ khc_htsearch.pl –docbook –indexdir=/home/brion/.kde/share/apps/khelpcenter/index/ –config=kde_application_manuals –words=multi-file+search –method=and –maxnum=5 –lang=en
Can’t execute htsearch at ‘/srv/www/cgi-bin/htsearch’.

Sigh…..

Wiki data dumps

There’s a few things we can do to fix up the data dump process again…

  • New machines: Moving the dump runners from the old boxes they’re on to a couple of our newer quad-core boxes should improve things.
  • Improve parallelism: When generating bzip2 files, ensuring that the dbzip2 configuration is set up properly may help.

    For .7z files, not sure… There’s a note in the changelog for p7zip that LZMA compression is multithreaded as of 4.52; if that gives a good speedup on 2x and 4x boxes, that could be very attractive.

    Figuring out more ways to split across machines could be beneficial as well.

  • Improve read speed: 7z is a *lot* faster to decompress than bzip2. Using the prior .7z dump instead of the prior .bz2 could help speed things up, but last time I tried that I had problems with the pipes not closing properly, leading to hangs at the end.
  • More robust dumpTextPass: the longer it takes, the more likely it is to die due to a database burp. If the actual database-reading part is pushed out to a subprocess, that can be easily restarted after an error while the parent process, which is based around reading stub and previous XML files, keeps on going.
  • Splitting files? There’s some thought that dividing the dumps into smaller pieces might be more reliable, as each piece can be re-run separately if it breaks — as well as potentially run in parallel on separate boxes.

It may also be worth dropping the .bz2 files in favor of .7z, especially if we can speed up the compression.

Also have to check if J7Zip can decompress the files and if it’d be possible to integrate it into mwdumper; I don’t like having to rely on an external tool to decompress the files. I hadn’t had any luck with the older java_lzma.

Update: Just for fun, I tried compressing & decompressing a 100 meg SQL file chunk with various proggies on my iMac (x86_64 Linux, 2.16 GHz Core 2 Duo). Parallelism is measured here as (user time / (real time – system time)) as an approximation of multi-core CPU utilization; left out where only one CPU is getting used.

Program Comp time Comp parallelism Decomp time
gzip 10.354 s 1.616 s
bzip2 17.018 s 5.542 s
dbzip2 10.136 s 1.95x
7zr 4.43 81.603 s 1.47x 3.771 s
7za 4.55 98.201 s 1.46x 3.523 s

Nothing seems to be doing parallelism on decompression.

p7zip seems to be using the second CPU for something during compression, but it’s not fully utilized, which leads me to suspect it won’t get any faster on a four-way or eight-way box. dbzip2 should scale up a couple more levels, and can use additional nodes over the network, but you’re still stuck with slow decompression.

Update: I noticed that 7za also can work with .bz2 archives — and it does parallelize decompression! Nice.

Program Comp time Comp parallel Decomp time Decomp parallel
7za 4.55 .bz2 17.668 s 1.99x 2.999 s 1.91x

The compression speed is worse than regular bzip2, but the decompression is faster. Innnnteresting…

Update: Some notes on incremental dump generation

Changing AAA memberships when moving?

For those not familiar with the custom, here in the US of A it’s not uncommon for people to get a membership with the AAA (American Automobile Association). For your modest annual dues you get access to basic roadside assistance if you have car trouble (jumpstarting, towing, etc) as well as various discounted insurance and travel services.

I’ve had a membership for years and only used a few services, but it gives me a warm fuzzy feeling to know I have it if I need it.

The one funny thing is that AAA isn’t a single organization, but a federation of many different regional clubs. My membership is with the Automobile Club of Southern California; when I moved to Florida I just renewed it rather than trying to switch to the Auto Club South, in part because I wasn’t sure how long I’d be here.

And sure enough, it looks like I’ll be moving back to California next year — but to San Francisco, which is in the California State Automobile Association‘s territory (Utah, Nevada, and Northern California).

The question that I’ve had surprisingly little luck googling or searching FAQs is — is there any benefit to switching my membership to the local club? Would I lose access to insurance I already have through AAA if I did? Would I have trouble getting new insurance if I’m living outside their main territory?

I guess I should call them and ask. Bah. I hate the telephone.

Mac v Linux

I first switched to the Mac in ’03 after a few years of being a mostly Linux/BSD guy. Aside from the ability to test Wikipedia in Mac browsers, I was drawn by the oh-so-cute factor of the 12″ aluminum PowerBook and more importantly the way it actually was able to detect its included hardware and attached monitors. ;)

Four years later, desktop Linux is better than ever but still tends to fall down and wet itself when doing things like configuring a multimonitor configuration or installing Flash and Java plugins in 64-bit mode. I’d be afraid to even try it on a laptop without knowing that sleep/wake and external monitor hookup work properly on that exact model.

But when I switched I promised myself I would retain my freedom to switch back. Today I’m using a Mac laptop and a Linux desktop together in the office; if I wanted to switch 100% to Linux, what would I need to change?…

Mac app Linux app
Firefox Ahh, open source. :)
Thunderbird
Gimp
NeoOffice OpenOffice
TextMate / BBEdit gedit? jEdit? Eclipse? I haven’t really been happy with *nix GUI editors. Emacs is not an acceptable option. ;)

I need a good project-wide regex search/replace, good charset support, ability to open & save files over SFTP, and syntax highlighting/sensitivity that doesn’t interfere with my indenting.

Being easy to load files from a terminal and not sucking are pluses.

Yojimbo Tomboy? I use Yojimbo constantly for notes, scratch space, web receipts, chat snippets, todo lists, reference cheat sheets, anything and everything.

Simple as it is, I love this app! The closest thing I’ve used on *nix is Tomboy, but it doesn’t feel as smooth to me. I’ll just have to fiddle with it more… figuring out how to import all my existing data would be another issue.

QuickSilver Gnome desktop launcher? I’ve found QuickSilver invaluable for launching various apps… I used to switch to Terminal and run ‘open -a Firefox’ and such. ;) I think the new launcher which will be included with Ubuntu Gutsy will serve okay on this, though I haven’t tried it.
Keynote OpenOffice Impress Wonder if it’s got the nice preview-on-second-screen that Keynote does.
Parallels VMWare Workstation Already use this on my office Linux box.
iChat Pidgin Been using Pidgin a bit on my Linux box in the office; it’s pretty decent these days.
Colloquy XChat-GNOME Kind of awkward, but I haven’t found an IRC client I’m happier with on *nix.
Google Earth Google Maps I haven’t had any luck getting the Linux version of Google Earth to run on my office box, but the web version is usually fine.
iTunes RhythmBox I’d have to strip DRM from my iTMS tracks, but that’s certainly doable. Don’t know whether it’ll be able to sync with an iPhone, though. ;)
iPhoto F-Spot I took a quick peek on the F-Spot web site and was surprised to find nothing about importing from iPhoto. Should be doable; the photos are all just JPEGs and the metadata’s in some kind of XML last I looked.
NetNewsWire ? I haven’t found a good RSS reader on *nix yet.
iCal Evolution calender? Sunbird? I guess I could use Google Calender, but it’s kind of nice to have something that works locally.

The biggest lapse if I switch at home would be in the video editing / multimedia end of things, which I dabble in sometimes and keep meaning to get back into more. I’m pretty happy with the Apple pro apps (Final Cut, Motion, etc), and there’s not really much touching that in Linux-land.

Mailman sucks

GNU Mailman kinda sucks, but like democracy I’ve yet to come across something better. ;)

A few things I’d really like to see improved:

  • Web archive links in footer

    I find I fairly commonly read something on a list and then want to discuss it with other folks in chat. To point them at the same message I was reading, I have to pull up the web archives, then poke around to find it, then copy the link.

    In an ideal world, the message footer could include a link to the same message on the web archives, and I could just copy and paste.

  • Thread moderation

    It should be easy to place an out-of-control thread on moderation. I’ll be honest, I can’t figure out how to do it right now. There’s _spam_ filtering, but we discard that. There’s whole-list moderation. There’s per-user moderation. But how do you moderate a particular thread?

  • Throttling

    In some cases, a simple time-delay throttle can help calm things down without actually forcing a moderator to sit there and approve messages. It can feel “fairer” too, since you’re not singling out That One Guy Who Keeps Posting In That One Thread.

  • Easy archive excision

    On public mailing lists, sometimes people post private information accidentally (phone numbers in the signature, private follow-up accidentally sent to list, etc) which they then ask to be removed from the archives. People can understandably get a bit worked up over privacy issues, particularly when the Google-juice of a wikimedia.org domain bumps the message to the first Google hit for their name. ;)

    Unfortunately it’s a huge pain in the ass to excise a message from the archives in mailman. You have to shut down the mailing list service, edit a multi-megabyte text file with all that month’s messages, carefully so you don’t disturb the message numbering, rebuild the HTML archive pages from the raw files, and then, finally, restart the mailman service. That’s a lot of manual work and some service outage balanced against having people scream at you, arguably justifiably, and for now we’ve ended up simply disabling crawler access to the archives to keep them out of high-ranked global search indexes. (I know you disagree with this, Timwi. It’s okay, we still love you!)

    If we could strike a message from the archives in a one-touch operation, the way we can unsubscribe someone who can’t figure out the unsubscribe directions, we could switch those crawlers back on and make it easier to search the list.

  • Integrated archive search

    We’ve experimented with the htdig integration patch, but the search results are not terribly good and the indexing performance is too slow on our large archives. Even if we get Google etc going again, it’d be nice to have an integrated search that’s a little more polished.