Wikipedia downtime 2x today

Well, today was exciting! Wikimedia’s sites experienced two downtime events today.

The first, which lasted about 30 minutes, was due to a power problem. While Rob was performing maintenance fixing up power in rack B2, power was inadvertently shut off to an access switch serving another rack of servers, which took a chunk of our core text storage offline.

The second, which also lasted about 30 minutes, was caused by a file server failure. The file server that holds our NFS home directories and misc files and logs experienced a kernel crash, then turned up some disk errors on reboot. (Possibly two failed drives, which may hose the array.)

Ideally this wouldn’t disturb production web serving, but various debugging logs were being saved onto this server, and this caused the web servers to hang waiting for NFS to come back up.

We’ve disabled the internal debug logging for now, and the site’s back up and running while we poke at recovering or replacing the file server.

Both of these problems can be ameliorated in the future with some more failure-proof design:

  • Spreading text storage clusters across multiple racks will protect against localized power or network failures
  • Moving debug logs to a UDP system will have a more graceful failure mode for centralized logging than hanging NFS shares