Incremental dumps

A follow-up to my previous notes on dumps

As an optimization to avoid hitting the text storage databases too hard, the wiki XML dumps are done in two passes:

  1. dumpBackup.php --skeleton pulls a consistent snapshot of page and revision metadata to create a “skeleton dump”, without any of the revision text.
  2. dumpTextPass.php reads that XML skeleton, alongside the previous complete XML dump. Revision text that was already present in the previous dump is copied straight over, so only newly created revisions have to be loaded out of the database.

It should be relatively easy to modify this technique to create an incremental dump file, which instead of listing out every page and revision in the entire system would list only those which have changed.

The simplest way to change the dump schema for this might be to add an action attribute to the <page> and <revision> elements, with create, update, and delete values:

<mediawiki>
  <page action="create">
    <!-- Creating a new page -->
    <id>10</id>
    <title>A new page</title>
    <revision action="create">
      <!-- And a new revision. Easy! -->
      <id>100</id>
      <timestamp>2001-01-15T14:03:00Z</title>
      <contributor>...</contributor>
      <text>...</text>
    </revision>
  </page>
  <page action="update">
    <!-- This page has been renamed. Update its record with new values. -->
    <id>11</id>
    <title>New title</title>
    <revision action="create">
      <!-- And a new revision. Easy! -->
      <id>110</id>
      <timestamp>2001-01-15T14:03:00Z</title>
      <contributor>...</contributor>
      <comment>Renamed from "Old title" to "New title"</comment>
      <text>...</text>
    </revision>
  </page>
  <page action="delete">
    <!-- This page has been deleted -->
    <id>12</id>
    <revision action="delete">
      <id>120</id>
    </revision>
  </page>
</mediawiki>

Perhaps those could be moved down to finer granularity for instance to indicate whether a page title was changed or not etc to avoid unnecessary updates, but I’m not sure how much it’d really matter.

There are a few scenarios to take into account as far as interaction with unique keys:

  • Page titles (page_namespace,page_title): a page rename can cause a temporary conflict between two pages between one application and the next.
  • Revision IDs (rev_id): History merges could cause a revision to be ‘added’ to one page, and ‘removed’ from another which appears later in the data set. The insertion would trigger a key conflict.

We could try a preemptive UPDATE to give conflicting pages a non-conflicting temporary title, or we could perhaps use REPLACE INTO instead of INSERT INTO in all cases… that could leave entries deleted during the application, but they should come back later on so the final result is consistent.

In my quick testing, REPLACE performs just as well as INSERT when there are no conflicts, and not _insanely_ bad even when there are (about 80% slower in my unscientific benchmark), so when conflicts are rare that’s probably just fine. At least for MySQL targets. :D

Test imports of ia.wikipedia.org full-history dump; SQL generated by MWDumper, importing into MySQL 5.0, best time for each run:

$ time mysql -u root working < insert.sql 

real    0m20.819s
user    0m5.537s
sys     0m0.648s

Modified to use REPLACE instead of INSERT, on a fresh empty database:

$ time mysql -u root working < replace.sql 

real    0m20.557s
user    0m5.530s
sys     0m0.643s

Importing completely over a full database:

$ time mysql -u root working < replace.sql 

real    0m34.109s
user    0m5.533s
sys     0m0.641s

So that's probably feasible. :)

In theory an incremental dump could be made against a previous skeleton dump as well as against full dumps, which would make it possible to create additional incremental dumps even if full-text dumps fail or are skipped.

4 thoughts on “Incremental dumps”

  1. Brion, to me the proposed incremental dump system seems to have several drawbacks. It gives me the impression of an attempt to patch rather than redesign a system with inherent deficiencies.

    I see a few drawbacks in the proposed scheme:

    1: It is less KISS, probably technically and certainly conceptually, then a monthly (or quarterly) archive.

    2: It presumes there is a full dump to start with, which now is not the case for e.g. en:. It does not suggest any solution for producing that initial full dump in the first place, or to rebuild that initial full dump in two years from now, after yet another small accident occurred or the xml layout changed.

    3: If one dump is incomplete or otherwise flawed while the log claims success (as happened before) this will be even harder to spot than before as incremental dumps have a less predictable size.

    4: It makes consecutive incremental dumps highly interdependent. At one hand in terms of patching a broken dump (4A). At the other hand in terms of processing requirements for batch jobs that process the dump (4B).

    4A: It makes repairing a faulty incremental dump very difficult. Say the first incremental dump of a sequence of 5 was belatedly found to be deficient. One would need to rebuild all 5 incremental dumps, either one by one, or replace them with a huge incremental dump with all perils of being too large and unwieldy and breaking prematurely like current full dumps tend to do. With fixed period dumps this problem could be avoided.

    4B1:Each (say quartely) fixed period dump could be self-contained by reiterating for each article the last revision from the previous period. This way a researcher who exclusively focusses on patterns in say one calendar year could completely ignore dumps older than that year, thus saving lots of harddisk space and processor time, and possibly even more important dump download times.

    4B2:It means that processes that batch process the dumps (like wikistats and others) will have to parse the dumps twice, first to extract deletion info, then to filter deleted articles/revision from the dump before further processing (e.g. counting). Separating article content (in static fixed period dumps) and deletion info (in a growing delete/rename queue) from the start would make this less cumbersome.

  2. 1: I don’t understand why you think these are mutually exclusive.

    2: False; as noted above, incremental dumps can be made against skeleton dumps (which are relatively quick to create compared to the full-text dumps, and nearly never fail).

    3: See 2.

    4: Incremental dumps would patch from one dump to the next. I would expect full dumps to be available for all of them as well. If you get lost or corrupt your data, you can just start again.

    4A: The incremental dump contains the differences from one full dataset to the next. Their creation is not dependent on other prior dumps.

    4B1: Full dumps already expected at regular intervals.

    4B2: Eh?

  3. 1 What feedback on my earlier suggestions could give me the impression you saw some merit in the idea of partial dumps containing fixed intervals?

    2/3/4/etc So we will have dependable regular full backups soon, and for years to come? I’m sure you wanted that to be a surprise and now you gave the secret away. Oh my :)

Comments are closed.