Wikipedia data dumps future thoughts

There’s some talk & beginnings of work planning on a major overhaul of the Wikipedia/Wikimedia data dumps process.

The basic data model for the main content dumps hasn’t changed much in 10 years or so, when I switched us from raw blobs of SQL ‘INSERT’ statements to an XML data stream in order to abstract away upcoming storage schema changes… fields have been added over the years, and there have been some changes in how the dumps are generated to partially parallelize the process but the core giant-XML-stream model is scaling worse and worse as our data sets continue to grow.

One possibility is to switch away from the idea of producing a single data snapshot in a single or small set of downloadable files… perhaps to a model more like a software version control system, such as git.

Specific properties that I think will help:

  • the master data set can change often, in small increments (so can fetch updates frequently)
  • updates along a branch are ordered, making updates easier to reason about
  • local data set can be incrementally updated from the master data set, no matter how long between updates (so no need to re-download entire .xml.bz2 every month)
  • network protocol for updates, and access to versioned storage within the data set can be abstracted behind a common tool or library (so you don’t have to write Yet Another Hack to seek within compressed stream or Yet Another Bash Script to wget the latest files)

Some major open questions:

  • Does this model sound useful for people actually using Wikipedia data dumps in various circumstances today?
  • What help might people need in preparing their existing tools for a switch to this kind of model?
  • Does it make sense to actually use an existing VCS such as git itself? Or are there good reasons to make something bespoke that’s better-optimized for the use case or easier to embed in more complex cross-platform tools?
  • When dealing with data objects removed from the wiki databases for copyright/privacy/legal issues, does this have implications for the data model and network protocol?
    • git’s crypto-hash-based versioning tree may be tricky here
    • Do we need a way to both handle the “fast-forward” updates of local to master and to be able to revert back locally (eg to compare current and old revisions)
  • Technical issues in updating the master VCS from the live wikis?
    • Push updates immediately from MediaWiki hook points, or use an indirect notify+pull?
    • Does RCStream expose enough events and data already for the latter, or something else needed to ‘push’?
    • Can update jobs for individual revisions be efficient enough or do we need more batching?

Things to think about!