Wikipedia domain redirect problem

Looks like we had a problem this morning with our domain redirection configuration, which broke access to the site for at least some people for a while.

Wikimedia has a lot of non-default domains registered, which we set up as redirects to the various primary domains — for instance www.wikipedia.com redirects to www.wikipedia.org, the standard location for Wikipedia’s multilingual entry portal.

This is handled by setting up a special Apache web server virtual host configuration which accepts connections for all the domains we don’t actually host wikis on — this virtual host has a bunch of mod_rewrite settings which go through and decide which domain to send the request on to. It returns an HTTP redirect response to the browser, which then goes on to the correct site.

For efficiency, many of these responses are declared to be cacheable (“301 Moved Permanently”), since they always send on to the same spot. This means that multiple hits to the same redirected URL will make use of our Squid proxy caching layer, reducing traffic to our backend servers.

The unfortunate thing is that if the configuration gets messed up and people are sent to the *wrong* URL, that’s also cached. An accidental breakage in the redirect config file was made this morning while maintaining it, creating some redirect loops for URLs which weren’t supposed to redirect in the first place.

To fix it, we’ve been restarting the Squid proxies and clearing their caches to ensure that all bad redirects are flushed out of the system.

As part of our ongoing mission to create permanent fixes to known site maintenance problems, we’re pushing up some improvements already on our list but not yet reached:

  • Proper version control for the relevant config files
  • Staging server for web server configuration changes — something we can test against in the live environment but which doesn’t pollute the primary web caches if it breaks while we’re testing it

Why is everything broken this week? :)

We’ve tracked down today’s problems to a combination of a couple of things:

  1. There’ve been ongoing database locking issues with the site statistics updates — these would all block on each other, making page saves very slow at times
  2. … which held open database connections, causing the text storage servers to start locking out new connections …
  3. … which exacerbated problems with the failover behavior of recent changes to the storage and load balancing code.

The code changes have been rolled back, fixing the slow site load behavior. (doing this correctly unfortunately was a bit painful, as we had to restore the broken code for a while in order to pick out what was going on enough to fully revert it again.)

Domas believes the main culprit on the database locking is actually an issue with our mail server — some actions (such as creation of new accounts) would involve both mail and updates to the site statistics table. With overload to the mail server, and a very simple local mail client called from MediaWiki, the outgoing mail would sometimes hang, while the transaction was still open, causing the locks, causing other updates to stall.

As a temporary measure I’ve disabled the site stats updates, fixing the failures on page save. (They’ll need to be re-updated after we’ve totally resolved it.)

We’re looking at the way the mail servers are set up to see if we can ensure that internal connections don’t stall the way they were; we should also be able to rearrange the transactions so that things are committed before the mail goes out!

Wikipedia downtime 2x today

Well, today was exciting! Wikimedia’s sites experienced two downtime events today.

The first, which lasted about 30 minutes, was due to a power problem. While Rob was performing maintenance fixing up power in rack B2, power was inadvertently shut off to an access switch serving another rack of servers, which took a chunk of our core text storage offline.

The second, which also lasted about 30 minutes, was caused by a file server failure. The file server that holds our NFS home directories and misc files and logs experienced a kernel crash, then turned up some disk errors on reboot. (Possibly two failed drives, which may hose the array.)

Ideally this wouldn’t disturb production web serving, but various debugging logs were being saved onto this server, and this caused the web servers to hang waiting for NFS to come back up.

We’ve disabled the internal debug logging for now, and the site’s back up and running while we poke at recovering or replacing the file server.

Both of these problems can be ameliorated in the future with some more failure-proof design:

  • Spreading text storage clusters across multiple racks will protect against localized power or network failures
  • Moving debug logs to a UDP system will have a more graceful failure mode for centralized logging than hanging NFS shares

MediaWiki hosting for SourceForge projects

The other day SourceForge launched their new hosted apps system, allowing SF-hosted open source projects to much more easily set up some web tools for their projects. The apps available at launch include phpBB, MediaWiki, and LimeSurvey.

While it’s been possible to run MediaWiki in your SourceForge project web space for a long time, it’s been a little tricky to set up, particularly as they’ve tightened security configurations in the last couple years.

Centralized administration, authentication, and maintenance should make it a lot easier for SourceForge project admins to get a wiki up and running for their project, and more wiki equals more fun! ;)

Handheld and print style customization

After a previous reworking of MediaWiki’s stylesheet-handling code to allow adding handheld stylesheets, I’ve gone ahead and implemented bug 2889 adding per-site customizable MediaWiki:Print.css and MediaWiki:Handheld.css pages.

The ability to specify some handheld tweaks is needed to be able to work around issues with certain kinds of layout formatting, especially the big beautiful multi-column table layouts which are popular on portal and main pages.

While lovely on a large screen, on a small device they tend to either make the columns reaaaally tiny or push things out off screen. On English Wikipedia I’ve thrown in some quick style hacks to flatten out those tables on the main page (this was applied already by Opera Mini’s classic view, but not Opera’s other browsers in small-screen mode):

Before:
After:

There are still improvements that can be done, but it at least helps things fit on screen! MediaWiki:Handheld.css can be edited on each of our wikis to tweak things up as desired/required.

Of course it’s always best to try to use clean, scalable styles that work on small screens to begin with. :)