Wiki data dumps

There’s a few things we can do to fix up the data dump process again…

  • New machines: Moving the dump runners from the old boxes they’re on to a couple of our newer quad-core boxes should improve things.
  • Improve parallelism: When generating bzip2 files, ensuring that the dbzip2 configuration is set up properly may help.

    For .7z files, not sure… There’s a note in the changelog for p7zip that LZMA compression is multithreaded as of 4.52; if that gives a good speedup on 2x and 4x boxes, that could be very attractive.

    Figuring out more ways to split across machines could be beneficial as well.

  • Improve read speed: 7z is a *lot* faster to decompress than bzip2. Using the prior .7z dump instead of the prior .bz2 could help speed things up, but last time I tried that I had problems with the pipes not closing properly, leading to hangs at the end.
  • More robust dumpTextPass: the longer it takes, the more likely it is to die due to a database burp. If the actual database-reading part is pushed out to a subprocess, that can be easily restarted after an error while the parent process, which is based around reading stub and previous XML files, keeps on going.
  • Splitting files? There’s some thought that dividing the dumps into smaller pieces might be more reliable, as each piece can be re-run separately if it breaks — as well as potentially run in parallel on separate boxes.

It may also be worth dropping the .bz2 files in favor of .7z, especially if we can speed up the compression.

Also have to check if J7Zip can decompress the files and if it’d be possible to integrate it into mwdumper; I don’t like having to rely on an external tool to decompress the files. I hadn’t had any luck with the older java_lzma.

Update: Just for fun, I tried compressing & decompressing a 100 meg SQL file chunk with various proggies on my iMac (x86_64 Linux, 2.16 GHz Core 2 Duo). Parallelism is measured here as (user time / (real time – system time)) as an approximation of multi-core CPU utilization; left out where only one CPU is getting used.

Program Comp time Comp parallelism Decomp time
gzip 10.354 s 1.616 s
bzip2 17.018 s 5.542 s
dbzip2 10.136 s 1.95x
7zr 4.43 81.603 s 1.47x 3.771 s
7za 4.55 98.201 s 1.46x 3.523 s

Nothing seems to be doing parallelism on decompression.

p7zip seems to be using the second CPU for something during compression, but it’s not fully utilized, which leads me to suspect it won’t get any faster on a four-way or eight-way box. dbzip2 should scale up a couple more levels, and can use additional nodes over the network, but you’re still stuck with slow decompression.

Update: I noticed that 7za also can work with .bz2 archives — and it does parallelize decompression! Nice.

Program Comp time Comp parallel Decomp time Decomp parallel
7za 4.55 .bz2 17.668 s 1.99x 2.999 s 1.91x

The compression speed is worse than regular bzip2, but the decompression is faster. Innnnteresting…

Update: Some notes on incremental dump generation

9 thoughts on “Wiki data dumps”

  1. Re: multithreaded 7-zip:
    I had assumed it was multithreaded for compression already, from the “2 CPUs” message:
    > 7-Zip 4.43 beta Copyright (c) 1999-2006 Igor Pavlov 2006-09-15
    > p7zip Version 4.43 (locale=en_AU.UTF-8,Utf16=on,HugeFiles=on,2 CPUs)
    (actually a google hit said something about it being multithreaded by default since v4.42, rather than v4.52)… However, as you say, it may only be partial. To test it on Linux, I guess “aptitude install sysstat”, then do a 7-zip compression of something large, and then to see the load on each CPU every 2 seconds 10 times, do “mpstat -P ALL 2 10”, and if it shows every CPU at around 100% usage / 0% idle for the average, then presumably 7-zip is fully utilizing all available CPUs / cores? When I tried this with 7zip 4.43 beta on an older 2 CPU x86 machine (and compressing using “/usr/bin/7z a -mx=9 -mfb=64 -md=32m -bd test.sql.7z test.sql” command line), I got this result:
    ——————————
    Average: CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
    Average: all 88.55 0.00 0.65 0.00 0.00 0.00 0.00 10.81 253.54
    Average: 0 95.51 0.00 0.50 0.00 0.00 0.00 0.00 3.99 250.25
    Average: 1 81.66 0.00 0.90 0.00 0.00 0.00 0.00 17.55 3.29
    ——————————
    …. so yeah, around 1.89x comp parallelism, whereas ideally it’d be 1.99x. Still, it’s better than 1.00x ;-)

  2. Brion, is it possible to make dumps a little bit more useful? I mean, is it possible to split very large backups? It is not so useful to get 10 GB 7z file which be uncompressed onto 100GB single file, while my biggest single partition is around 80GB. OK, I may try to make some tricks, but…

  3. What would you actually do with a small piece at a time, out of curiosity? What sort of technique would you actually follow that would make smaller pieces helpful?

    That’s not a rhetorical question, I really want to know so we can make this process better for the people using the dumps.

  4. Brion, most of your post addresses optimizations on the current process. This is of course very welcome, and hopefully buys us time. But sooner or later the English Wikipedia will have grown again to a size that takes 6 weeks to dump and then longer.

    You already built a mechanism to append new data to an existing dump. I assume that saves a lot of time, as it saves tremendously on SQL calls. All those joins should mean quite a few I/O’s per revision dumped, right? of course mitigated by cached indexes. Yet I wonder:

    Q1: Is this incremental dump system used often or does it have practical obstacles?

    Q2: As reading a full dump takes about one day only, a normal full dump (now 6 weeks for the English Wikipedia) is mostly SQL calls right?

    ————–

    I could see at least two ways to split up the archive dump and make the process more manageable, at the cost of being somewhat more complicated:

    1 Produce chunks that contain exactly one calendar month of data, which is preserved as a separate entity ever after. Every new dump needs to collect data not collected before and complete the latest unfinished month and start a new one.

    A separate queue of article and revision deletions (keys only) is updated on an ongoing basis (appending keys for new deletions). So for every month there is a never changing dump, and a growing delete queue, to be merged on a researchers PC.

    This way every month needs to be downloaded only once, and for each month a thousands times smaller deletion queue needs to to be redownloaded and reapplied occasionally (of course in an automated fashion).

    One downside might be that users do not apply the latest deletions. To this I would say it is unavoidable anyway that users keep old dumps that contain officially banned records, if they wish to do so. The well behaved way is mainly there to get proper data for research, stats and fork purposes.

    I know the total size of the dumps would grow due to less optimal compression, as bad side effect. But this system greatly shortens both dump and download times.

    2 Another way to make the process better restartable would be to produce a new dump every time but write to a new file for every 100,000 articles. The current English dump would result in +/- 25 files. This would not save on dump time, but greatly simplify disaster recovery/restartability. The process might even be distributed over several servers, each doing a part of the English wikipedia.

    In both scenarios it means either that
    – existing scripts (like wikistats) need to be adapted to handle multiple input files or
    – a new script needs to be built to merge chunks to an archive file as we know it now.

    —-

    unrelated: I sent you a mail several weeks ago on two addresses. Did you get any? If not please mail me and I will respond to that. Thanks.

  5. I find smaller, piecewise dumps to be a necessity for my work — so that if I run some code on a dump, and the code breaks, I don’t have to re-start from scratch.
    In fact, I wrote some code to take a single dump and break it into smaller dumps — go to http://trust.cse.ucsc.edu/ and follow the link to Code.

    From what Brion told me in another discussion, much of the time required for a dump is actually taken up by the compression phase. The actual dump takes only a couple of days.
    Since the Feb 6 07 dump already has been compressed, I think it would not be necessary to produce a single compressed dump after that — monthly compressed incremental dumps are all that is required.

  6. Brion, I would be able to analyze data from smaller parts:

    – wget part1
    – analyze part1
    – wget part2
    – analyze part2
    – …
    – merge data

    And sorry for waiting for respond.

  7. I suscribe all your improvement ideas, Brion, as well as the ideas proposed by Erik.

    In case this is useful for you, I am currently finishing a new version of my Python parser. I tested 7zip decompression several types, piped to the parser then piped to MySQL. I can confirm that 7zip paralellizes decompression much better (I use 2 AMD Santa Rosa 2 GHz, 4 cores in total, 3Ware RAID-6 with 8 Seagate fast SATA-II disks).

    Big Wikipedia dumps crash (no matter the language edition) around the 4.500.000 revision mark when piped, and exception control is required (so I now forget about pipes, and use a library for directly conecting the DB). Therefore, chunks will be more than welcomed. Incremental back-ups will be far more flexible than the current solution.

    I already take advantage of extended inserts in the parser, to speed up even more the recovery process.

Comments are closed.