rsync 3.0 crashy :( [fixed!]

rsync 3.0 may rock, but it’s also kinda crashy. :(

It’s died a couple of times while syncing up our ~2TB file upload backup. I’ve attached some gdb processes to try and at least get some backtraces next time it goes.

Update: Got a backtrace…

#0  0x00002b145663886b in free () from /lib/libc.so.6
#1  0x000000000043277b in pool_free_old (p=0x578730, addr=) at lib/pool_alloc.c:266
#2  0x0000000000404374 in flist_free (flist=0x89e960) at flist.c:2239
#3  0x000000000040ddbc in check_for_finished_files (itemizing=1, code=FLOG, check_redo=0) at generator.c:1851
#4  0x000000000040e189 in generate_files (f_out=4, local_name=) at generator.c:1960
#5  0x000000000041753c in do_recv (f_in=5, f_out=4, local_name=0x0) at main.c:774
#6  0x00000000004177ac in client_run (f_in=5, f_out=, pid=25539, argc=, 
    argv=0x56cf98) at main.c:1021
#7  0x0000000000418706 in main (argc=2, argv=0x56cf90) at main.c:1185

Looks like others may have seen it in the wild but a fix doesn’t seem to be around yet. Some sort of bug in the extent allocation pool freeing changes done in May 2007, I think.

Found and patched a probably unrelated bug in pool_alloc.c.

Further updated next day:

The crashy bug should be fixed now. Yay!

PHP 5.2.4 error reporting changes

Noticed a couple neat bits combing through the changlogs for the PHP 5.2.4 release candidate…

  • Changed “display_errors” php.ini option to accept “stderr” as value whichmakes the error messages to be outputted to STDERR instead of STDOUT with CGI and CLI SAPIs (FR #22839). (Jani)

This warms the cockles of my heart! We do a lot of command-line maintenance scripts for MediaWiki, and it’s rather annoying to have error output spew to stdout by default. Being able to direct it to stderr, where it won’t interfere with the main output stream, should be very nice.

  • Changed error handler to send HTTP 500 instead of blank page on PHP errors. (Dmitry, Andrei Nigmatulin)

This in theory should give nicer results for when the software appears to *just die* for no reason — with display_errors off, if you hit a PHP fatal error the code just stops and nothing else gets output. In an app that does its processing before any output, the result is a blank page with no cues to the user as to what happened.

Unfortunately it looks like it’s only going to be a help to machine processing, and even then only for the blank-page case. :(

In my quick testing, I only get the 500 error when there was no output done… and it *still* returns blank output, it just comes with the 500 result code.

The plus side is this should keep blank errors out of Google and other search indexes; the minus side is it won’t help with fatal errors that come in the middle of output, or the rather common case of sites which leave display_errors on… because then the error message gets output, so you don’t get a 500 result code.

rsync 3.0 rocks!

Wikimedia’s public image and media file uploads archive has been growing and growing and growing over the years, nowadays easily eating 1.5 TB or so.

This has made it harder to provide publicly downloadable copies, as well as to maintain our own internal backup copies — and not having backups in a hurricane zone is considered bad form.

In the terabyte range, making a giant tar archive is kind of… difficult. Not only is it insanely hard to download the whole thing if you want it, but it multiplies our space requirements — you need space for every complete and variant archive as well as all the original files. Plus it just takes forever to make them.

rsync seems like a natural fit for updating and then synchronizing a large set of medium-size files, but as we’ve grown it became slower and slower and slower.

The problem was that rsync worked in two steps:

First, it scanned the directory trees to list all the files it might have to transfer.

Then, once that was done, it transferred the ones which needed to be updated.

Great for a few thousand files — rotten for a few million! On an internal test, I found that after a full day the rsync process was still transferring nothing but filenames — and its in-memory file list was using 2.6 *gigabytes* of RAM.

Not ideal. :)

Searching around, I stumbled upon the interesting fact that the upcoming rsync 3.0 does “incremental recursion” — that is, it does that little “list, then transfer” cycle for each individual directory instead of for the entire file set at once.

I grabbed a development tree from CVS, compiled it up, and gave it a try — within seconds I started to see files actually being transferred.

Hallelujah! rsync works again…

We’re now in process of getting our internal backup synced up again, and will see about getting an off-site backup and maybe a public rsync 3.0 server set up.