Wikimedia’s public image and media file uploads archive has been growing and growing and growing over the years, nowadays easily eating 1.5 TB or so.
This has made it harder to provide publicly downloadable copies, as well as to maintain our own internal backup copies — and not having backups in a hurricane zone is considered bad form.
In the terabyte range, making a giant tar archive is kind of… difficult. Not only is it insanely hard to download the whole thing if you want it, but it multiplies our space requirements — you need space for every complete and variant archive as well as all the original files. Plus it just takes forever to make them.
rsync seems like a natural fit for updating and then synchronizing a large set of medium-size files, but as we’ve grown it became slower and slower and slower.
The problem was that rsync worked in two steps:
First, it scanned the directory trees to list all the files it might have to transfer.
Then, once that was done, it transferred the ones which needed to be updated.
Great for a few thousand files — rotten for a few million! On an internal test, I found that after a full day the rsync process was still transferring nothing but filenames — and its in-memory file list was using 2.6 *gigabytes* of RAM.
Not ideal. :)
Searching around, I stumbled upon the interesting fact that the upcoming rsync 3.0 does “incremental recursion” — that is, it does that little “list, then transfer” cycle for each individual directory instead of for the entire file set at once.
I grabbed a development tree from CVS, compiled it up, and gave it a try — within seconds I started to see files actually being transferred.
Hallelujah! rsync works again…
We’re now in process of getting our internal backup synced up again, and will see about getting an off-site backup and maybe a public rsync 3.0 server set up.