The system works!

Back in the olden days when Wikipedia was young and new, I used to get sucked into all sorts of bug hunts and fix deployments on the servers. While it can be a great fun time to do live debugging on a huge system like that, it’s also a major time sink!

Since coming back to Wikimedia a few weeks ago, I’ve tried to keep my focus narrower: my most direct work is preparation & experimental groundwork for the big parser & visual editor projects, so for other things I’m trying to be available for advice and review without actually diving fully into every cool thing that comes up.

Today we had a great example of how our “post-startup” Wikimedia Engineering department can work well in getting a quick live fix pushed:

  • Philippe was following up on a deletion issue reported from the community: a stray deleted category page was stuck in Google listings, causing some annoyance. He could direct Google to re-scan the page, but it kept claiming the page still existed. Unable to figure out why, Philippe asked me to take a look….
  • I confirmed that MediaWiki was reporting the page as existing (HTTP 200 OK rather than HTTP 404 Not Found) and, remembering vaguely how the 404 reporting worked, pulled up the source of CategoryPage.php to see exactly what it was checking for.
  • By hitting the MediaWiki API I could confirm that the current source code *should* return a 404 based on the category’s empty page counts in the database…
  • …but comparing the current development code against the MediaWiki 1.17 deployment branch, I could see that this was a new fix which hadn’t been deployed yet!
  • Since it was a clean simple fix, I marked it as ok on our CodeReview system and tagged it for merging.
  • I was then able to hit up some of the ops & dev folks in #wikimedia-dev to see it we could get the fix out quickly or if it would have to wait for later deployment in a batch with other stuff.
  • It being a nice clean isolated fix, Sam concurred it was safe to merge, and Arthur coordinated pushing it out. Within a few minutes, we confirmed the fix on test.wikipedia.org and then got it deployed to all wikis.
  • A quick purge on the affected page to ensure caches are clear, and Philippe was able to run it through Google again and get it cleared out immediately!

This is kind of an ideal case: the fix was already in development trunk, it’s a small clean patch that’s easy to review and merge, and testing the return code on a deleted category is very easy to do. But it does confirm that we’ve got a basically sane setup where issues can be pushed along from confirmation to merging to testing to deployment in a reasonably speedy fashion, without having to bottleneck it all through a specific “does-everything-themselves” person.

Makes me happy!