Why is everything broken this week? :)

We’ve tracked down today’s problems to a combination of a couple of things:

  1. There’ve been ongoing database locking issues with the site statistics updates — these would all block on each other, making page saves very slow at times
  2. … which held open database connections, causing the text storage servers to start locking out new connections …
  3. … which exacerbated problems with the failover behavior of recent changes to the storage and load balancing code.

The code changes have been rolled back, fixing the slow site load behavior. (doing this correctly unfortunately was a bit painful, as we had to restore the broken code for a while in order to pick out what was going on enough to fully revert it again.)

Domas believes the main culprit on the database locking is actually an issue with our mail server — some actions (such as creation of new accounts) would involve both mail and updates to the site statistics table. With overload to the mail server, and a very simple local mail client called from MediaWiki, the outgoing mail would sometimes hang, while the transaction was still open, causing the locks, causing other updates to stall.

As a temporary measure I’ve disabled the site stats updates, fixing the failures on page save. (They’ll need to be re-updated after we’ve totally resolved it.)

We’re looking at the way the mail servers are set up to see if we can ensure that internal connections don’t stall the way they were; we should also be able to rearrange the transactions so that things are committed before the mail goes out!