A follow-up to my previous notes on dumps…
As an optimization to avoid hitting the text storage databases too hard, the wiki XML dumps are done in two passes:
- dumpBackup.php --skeleton pulls a consistent snapshot of page and revision metadata to create a “skeleton dump”, without any of the revision text.
- dumpTextPass.php reads that XML skeleton, alongside the previous complete XML dump. Revision text that was already present in the previous dump is copied straight over, so only newly created revisions have to be loaded out of the database.
It should be relatively easy to modify this technique to create an incremental dump file, which instead of listing out every page and revision in the entire system would list only those which have changed.
The simplest way to change the dump schema for this might be to add an action attribute to the <page> and <revision> elements, with create, update, and delete values:
<mediawiki>
<page action="create">
<!-- Creating a new page -->
<id>10</id>
<title>A new page</title>
<revision action="create">
<!-- And a new revision. Easy! -->
<id>100</id>
<timestamp>2001-01-15T14:03:00Z</title>
<contributor>...</contributor>
<text>...</text>
</revision>
</page>
<page action="update">
<!-- This page has been renamed. Update its record with new values. -->
<id>11</id>
<title>New title</title>
<revision action="create">
<!-- And a new revision. Easy! -->
<id>110</id>
<timestamp>2001-01-15T14:03:00Z</title>
<contributor>...</contributor>
<comment>Renamed from "Old title" to "New title"</comment>
<text>...</text>
</revision>
</page>
<page action="delete">
<!-- This page has been deleted -->
<id>12</id>
<revision action="delete">
<id>120</id>
</revision>
</page>
</mediawiki>
Perhaps those could be moved down to finer granularity for instance to indicate whether a page title was changed or not etc to avoid unnecessary updates, but I’m not sure how much it’d really matter.
There are a few scenarios to take into account as far as interaction with unique keys:
- Page titles (page_namespace,page_title): a page rename can cause a temporary conflict between two pages between one application and the next.
- Revision IDs (rev_id): History merges could cause a revision to be ‘added’ to one page, and ‘removed’ from another which appears later in the data set. The insertion would trigger a key conflict.
We could try a preemptive UPDATE to give conflicting pages a non-conflicting temporary title, or we could perhaps use REPLACE INTO instead of INSERT INTO in all cases… that could leave entries deleted during the application, but they should come back later on so the final result is consistent.
In my quick testing, REPLACE performs just as well as INSERT when there are no conflicts, and not _insanely_ bad even when there are (about 80% slower in my unscientific benchmark), so when conflicts are rare that’s probably just fine. At least for MySQL targets. :D
Test imports of ia.wikipedia.org full-history dump; SQL generated by MWDumper, importing into MySQL 5.0, best time for each run:
$ time mysql -u root working < insert.sql
real 0m20.819s
user 0m5.537s
sys 0m0.648s
Modified to use REPLACE instead of INSERT, on a fresh empty database:
$ time mysql -u root working < replace.sql
real 0m20.557s
user 0m5.530s
sys 0m0.643s
Importing completely over a full database:
$ time mysql -u root working < replace.sql
real 0m34.109s
user 0m5.533s
sys 0m0.641s
So that's probably feasible. :)
In theory an incremental dump could be made against a previous skeleton dump as well as against full dumps, which would make it possible to create additional incremental dumps even if full-text dumps fail or are skipped.