rsync 3.0 rocks!

Wikimedia’s public image and media file uploads archive has been growing and growing and growing over the years, nowadays easily eating 1.5 TB or so.

This has made it harder to provide publicly downloadable copies, as well as to maintain our own internal backup copies — and not having backups in a hurricane zone is considered bad form.

In the terabyte range, making a giant tar archive is kind of… difficult. Not only is it insanely hard to download the whole thing if you want it, but it multiplies our space requirements — you need space for every complete and variant archive as well as all the original files. Plus it just takes forever to make them.

rsync seems like a natural fit for updating and then synchronizing a large set of medium-size files, but as we’ve grown it became slower and slower and slower.

The problem was that rsync worked in two steps:

First, it scanned the directory trees to list all the files it might have to transfer.

Then, once that was done, it transferred the ones which needed to be updated.

Great for a few thousand files — rotten for a few million! On an internal test, I found that after a full day the rsync process was still transferring nothing but filenames — and its in-memory file list was using 2.6 *gigabytes* of RAM.

Not ideal. :)

Searching around, I stumbled upon the interesting fact that the upcoming rsync 3.0 does “incremental recursion” — that is, it does that little “list, then transfer” cycle for each individual directory instead of for the entire file set at once.

I grabbed a development tree from CVS, compiled it up, and gave it a try — within seconds I started to see files actually being transferred.

Hallelujah! rsync works again…

We’re now in process of getting our internal backup synced up again, and will see about getting an off-site backup and maybe a public rsync 3.0 server set up.

Some installer tasks

Been poking about at MediaWiki, but not sure what to do? Here’s a few tasks that would help with some common problems for third-party users:

  • 1379: Install can’t find config/index.php
    Some hosting services put a “control panel” of some sort at the “/config” URL, making it difficult to get at the MediaWiki installer. Renaming this to something more unique, and providing a compatibility link for convenience, would not be very hard but would help people stuck on this sort of host.
  • 9954: Detect “extra whitespace” / BOM conditions
    PHP is very picky about extra whitespace at the start and end of source files. Unfortunately it’s not uncommon for people to end up with extra blank lines or a hidden Unicode BOM sequence at the start or end of files they’ve customized. This leads to weird, hard to diagnose problems like cookies not getting set or RSS feeds that break with little explanation. Some software support to detect this and report which file is broken (and how to fix it!) would be very helpful.
  • 10387: Detect and handle ‘.php5’ extension environments
    More and more hosting services are providing PHP 5.x, but some are putting it alongside existing PHP 4.x services, requiring that files be named with a .php5 extension. With a little care, the installer could detect this out of the box and set things up to work on such systems with few problems.
    Update 2007-06-28: Edward Z. Yang whipped up a good patch for this, which I’ve commited to trunk.

Wikimedia page views

For the curious, here’s some statistics I whipped up the other day from our sampled logs. These are counts of plain article page views on all of Wikimedia’s wikis, from about three days’ worth of a 1/1000-sampled log. Images, style sheets, edit actions, diffs, special pages, etc are excluded.

I’ve broken it down by referrer, with some rough groupings. It’s not very scientific, but some might find it interesting. :)

We should be able to process similar statistics on an automated basis from the log server now that we’ve got it set up, including breakdowns by site and by language.

Referrer Samples Percentage Daily extrapolation Monthly extrapolation
total 71873 100.0000 233,610,232 7,008,306,975
in-wiki link 31512 43.8440 102,424,076 3,072,722,293
Google 17102 23.7947 55,586,969 1,667,609,059
no referrer 11873 16.5194 38,591,047 1,157,731,397
IE7 gadget 3862 5.3734 12,552,735 376,582,048
other Wikimedia 2174 3.0248 7,066,195 211,985,855
in-wiki search 2140 2.9775 6,955,684 208,670,529
Yahoo 1507 2.0968 4,898,232 146,946,957
other 1162 1.6167 3,776,872 113,306,147
MSN 208 0.2894 676,067 20,281,995
Live 189 0.2630 614,310 18,429,313
AOL 84 0.1169 273,027 8,190,806
Ask 39 0.0543 126,762 3,802,874
AltaVista 21 0.0292 68,257 2,047,701

Synergy vs gnome-screensaver

I’ve been using Synergy to share my mouse & keyboard between my Linux desktop and Mac laptop in the office.

One of the features of Synergy that hasn’t been working so well for me is the screen-saver synchronization. I’m not too picky, but I do want to be able to quickly lock both screens at once so I can leave the room without leaving a bunch of server terminals open to anyone who walks in!

After a little research, I found that Synergy’s X11 server code looks explicitly for Xscreensaver, but Ubuntu ships with gnome-screensaver, which has a different interprocess control API based on DBUS. This is apparently an issue of much contention, as a lot of video players and other apps haven’t updated to speak the new protocol, and you end up with screen savers activating during long-playing files and such.

One possibility is to manually reconfigure Ubuntu to use Xscreensaver, but it would probably be cuter to add support for the DBUS interface to Synergy.

iProduct vs Veronica Mars

So I finally gave in and picked up an Apple TV unit; that frees up my Mac Mini from TV duty to be my main home computer, while letting the Apple TV concentrate on being a media player.

The good: unit is very compact, setup is pretty straightforward, and picture looks good once I adjust the ungodly color saturation my TV defaults to on the component input.

The bad: at least for the shows I tested (Veronica Mars season 3), video playback is totally broken at HD resolutions!

At 720p playback stutters very badly, with very jerky motion and sound out of sync from picture by about a second.

At 1080i I don’t even *get* picture during playback, just sound. (Menus display fine.)

At 480p everything looks great, though, and the currently available content doesn’t need more than that, so I’m leaving it there for now.

A quick Google scan doesn’t show any other obvious complaints of this problem, so I’m not sure if I’ve got a bogus unit or if it’s something funky with the Veronica Mars encoding that might not be a problem with other shows…

Update: At some point it started working fine. *shrug*

Leverage your synergy

On Rob’s advice, I set up Synergy to share my keyboard and mouse between my Linux and Mac boxes at the office.

Pretty straightforward to set up (if you’re a *nix geek); I had just one nasty surprise. If you’re sharing a keyboard from a PC server to a Mac OS X client, it switches the alt and command keys for you.

That might be a cute option if you’re using a PC keyboard, where Alt and Windows keys appear in the opposite order from the Mac’s Alt/Option and Command keys. Not so cute if you’re using a Mac keyboard and want things to remain sensible.

Luckily, it’s pretty easy to switch them back. In the screen section of the config file for the Mac client, add these options:

		super = alt
		alt = super

It seems to consider ‘super’ and ‘meta’ to be almost the same, but if you say ‘meta’ here it gets confused — you get two option keys and no command key.

Wikimedia in Google Summer of Code

Wikimedia’s been accepted as a mentoring organization for the 2007 Google Summer of Code program.

Here’s our organization page, and I put up an initial project list on meta.

The list is semi-protected so it won’t be too vandalized ;) but additional suggestions are welcome. I’d like to ask that people who aren’t directly involved in development not add too much to the main page directly, though; last year we ended up with lots of project submissions for things that weren’t really considered high priority, so I’d like to keep the list a little more ordered this time.

We don’t know for sure how many projects we’ll get assigned, so we’ll see. :) At least Tim and I will serve as mentors for the student projects; if a couple more experienced developers would like to help out with that too that would be super.

Last year’s projects went really well up to the public demo stage but never quite got integrated into the mainline; I’m hoping that this year we can stick with projects that will be easier to slip in and take live much earlier in the process, which should help keep the students interested and the projects active.