TitleBlacklist, title protection

A few days ago support for protecting deleted or not-yet-created pages from creation went live. Today I’ve also enabled Vasiliev’s TitleBlacklist extension, which allows admins to preemptively block potential titles via regular expressions.

Currently just the local blacklist is enabled, at MediaWiki:Titleblacklist on each wiki.

The regex behavior is a little different from the existing SpamBlacklist, so admins be sure to check the docs and test your entries. :) But it should come in rather handy for certain kinds of spam and silliness attacks.

Wikipedia on Leopard


The Dictionary.app included in Mac OS X 10.5 has support for making lookups to Wikipedia, optionally in various languages.

The actual display of articles seems to be done by loading the page out of the live Wikipedia and doing some custom filtering of it. This isn’t documented to us, so I hope we don’t break it by mistake!

The searching is done via a relatively simple REST protocol to do title-prefix searches as type-ahead suggestions.

Some Apple engineers whipped up a little index search using the DARTS C++ library, with a PHP wrapper extension around it for web output. The results are wrapped in some simple HTML, pretty straightforward to handle.

Once production finally rolled out, though, we encountered some problems:

  1. The number of page titles in the system has increased to the point where a complete index for all languages barely fits in memory on a 32-bit box. I had to break the index in two (English and non-English) just to get it to generate.
  2. Performance was spotty, sometimes mysteriously hanging up for several long seconds. I suspect this is due to the huge indexes loaded in memory; every once in a while something decides to swap.

I finally got my hands on a copy of Leopard to confirm I wasn’t breaking the client, so it’s time to see what I can do…

Rather than investing more resources into the DARTS indexer, I figured I’d see if we can roll this back in with our existing tools to make it easier to maintain.

We already have a type-ahead suggestion backend, which is used for our [[OpenSearch]] interface. If you’re running Firefox 2.0 or later you can pull up the ‘Wikipedia’ search and try it out.

I did some quick testing and confirmed that it was pretty easy to make a translator that would query the OpenSearch suggestion API and format results for the Apple widget; I just had to add a limit option, then a simple re-query and wrap the results.

On my quick benchmarking, performance at least isn’t any worse, and seems to be more consistent so far and gives up to date results — no waiting for the next index generation.

The one big problem right now is that our suggestion search is case-sensitive, since it pulls directly from the binary-collated page title columns in our core database. That’s a minor annoyance except that the Dictionary app sends us queries which have been forced to lowercase — so you can’t easily reach titles with caps past the first letter.

Guess it’s time to bring back the title key field and get that working properly so I can switch in the new version…

Cal-ih-for-nye-ay!

I’m pretty much up and running in San Francisco, and officially back to work as of today. I’ll be working from home for a couple more weeks until the office space is up and running…

Our stuff hasn’t arrived from Florida yet, so the apartment’s a little empty, but we’re intact and online with two cats, an air mattress, and a laptop. Desks? Bah! Who needs desks? :)

Random tests

We used to have a lot of complaints about the random page feature on Wikipedia/MediaWiki picking some pages *very* frequently. After various tweaking it got improved enough that we don’t hear much about it, but it’s still not very even. To get an idea of how big the problem remains, I did a little testing of the distribution of selections on my local test database, which is pretty fragmented and weird with various imports and test pages and mass deletions in its history. :)

I did these tests by grabbing random page_id values, with enough runs to select each page 100 times given a perfectly even distribution.

The current system uses a special index field, page_random. The field contains a random number between 0 and 1.0 for each page; when selecting pages, we pick a random float and have the database grab the first valid page with an index greater than or equal to that number. Ideally, the distribution of page_random values would be perfect — certainly it’s better than the page_id distribution! — and we should get fairly even selections.

But it’s not perfect, as there are going to be gaps of differing sizes between entries, and entries with large gaps before them will be predisposed to get more hits. In my test db, the most-frequently picked pages are five times as likely to be selected as ideal, and other pages are very unlikely to come up.

(The first graph is sorted by page_id; the second by hits.)

 513 |  *                                                                                                                                                                             
 487 |  *                                                                                                                                                                             
 461 |  *                                                                                                                                                                             
 436 |  *            *                                                                                                                                                                
 410 | **            *                   *                                                                                                                                            
 384 | **            *                   *                                                                                                                                            
 359 | **            *                   *                                                                                                                                            
 333 | ****          *   *               *                           *                  *                                                                                             
 307 | ****   **   * *   *            *  *                           *                  *    *                                                                                        
 282 | ****  ***   * *   *     *      *  *       *                  **                  *    *                                                                                        
 256 | ****  ***   * *   ** *  *      *  *     * *                  **                  *    *                            * *       *                                                 
 230 | ****  ***   * *   ** *  *      *  *     * *  *               **                  *    *                            * *       *                                                 
 205 | ****  ***  ** *   ** *  *      *  *   * * ** *               **     *     *      *    *                            * *       *                                                 
 179 | ****  ***  ** * * ** *  *    * *  *   * * ** **       *      **     *   * *      *  * **                     *     * *       *                                                 
 153 | **** ****  ** * * ** * ** *  * *  *   * * ** **       *      **     *   * *      *  * **                     *     * *       *                                                 
 128 | **** ****  ** * **** * ** ** * *  **  * * ** **     * **     **     * * * *      *  * **             *       ** *  * *       *    *                                            
 102 |***** ****  ** * **** **** **** *  **  * **** **   * * ****   **   *** * * *  *   * ** **            **    *  ** *  * *       *    **                                           
  76 |**********  ** * **** ***********  **  ****** ** * *** ****   **** *** * * *  **  * ***** * *  *   * **    *  ** *  ***       **** **   **      *     *                         
  51 |*********** ********************* **** *************** **** ****** *** * * ** ** ********** *  * *** **  ***  ** *  ***  ********* ***  **   ** *     *  *                      
  25 |************************************** *************** *************** ********* ********** **** *********** ****** ******************  **  *** * **  * **   * ***** *          
   0 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 513 |                                                                                                                                                                               *
 487 |                                                                                                                                                                               *
 461 |                                                                                                                                                                               *
 436 |                                                                                                                                                                              **
 410 |                                                                                                                                                                            ****
 384 |                                                                                                                                                                            ****
 359 |                                                                                                                                                                            ****
 333 |                                                                                                                                                                       *********
 307 |                                                                                                                                                                  **************
 282 |                                                                                                                                                              ******************
 256 |                                                                                                                                                        ************************
 230 |                                                                                                                                                       *************************
 205 |                                                                                                                                                  ******************************
 179 |                                                                                                                                          **************************************
 153 |                                                                                                                                       *****************************************
 128 |                                                                                                                             ***************************************************
 102 |                                                                                                               *****************************************************************
  76 |                                                                                         ***************************************************************************************
  51 |                                                             *******************************************************************************************************************
  25 |                                ************************************************************************************************************************************************
   0 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

As an example using page_id instead of page_random, the distribution is *much* worse. ;)

 13962 |*                                                                                                                                                                                
 13263 |*                                                                                                                                                                                
 12565 |*                                                                                                                                                                                
 11867 |*                                                                                                                                                                                
 11169 |*                                                                                                                                                                                
 10471 |*                                                                                                                                                                                
  9773 |*                                                                                                                                                                                
  9075 |*                                                                                                                                                                                
  8377 |*                                                                                                                                                                                
  7679 |*                                                                                                                                                                                
  6981 |*                                                                                                                                                                                
  6282 |*                                                                                                                                                                                
  5584 |*                                                                                                                                                                                
  4886 |*                                                                                                                                                                                
  4188 |*                                                                                                                                                                                
  3490 |*                                                                                                                                                                                
  2792 |*                                                                                                                                                                                
  2094 |*                                                                                                                                                                                
  1396 |*                                                                                                                                                                                
   698 |**                                                                                                                                                                               
     0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 13962 |                                                                                                                                                                                *
 13263 |                                                                                                                                                                                *
 12565 |                                                                                                                                                                                *
 11867 |                                                                                                                                                                                *
 11169 |                                                                                                                                                                                *
 10471 |                                                                                                                                                                                *
  9773 |                                                                                                                                                                                *
  9075 |                                                                                                                                                                                *
  8377 |                                                                                                                                                                                *
  7679 |                                                                                                                                                                                *
  6981 |                                                                                                                                                                                *
  6282 |                                                                                                                                                                                *
  5584 |                                                                                                                                                                                *
  4886 |                                                                                                                                                                                *
  4188 |                                                                                                                                                                                *
  3490 |                                                                                                                                                                                *
  2792 |                                                                                                                                                                                *
  2094 |                                                                                                                                                                                *
  1396 |                                                                                                                                                                                *
   698 |                                                                                                                                                                               **
     0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

But we could instead keep picking random page_id values until we hit a valid page (matching the namespace and not-a-redirect criteria). This gives us a much more even distribution:

 84 |                                                                                                                                                          *                      
 79 |                                                                        *                                                          *                      *                      
 75 |      *                                      *    *        *     *      *                    *     *                        *      *                  *   *                      
 71 |      *                *         *           *    *     *  *   * **     *            *       *   * *                        ** *   *       *     *    *   *                   *  
 67 | * *  **               **       ***        * **   *** * * ** * * ***    *     *    ***   *   **  * *  * *      * *    * *   ****   *       *    **    * * *                * **  
 63 | * ** ** * * * *    ***** *  * *****    * ** ***  *** * * ** ********   *  *  * * ****  **   ***** ** * *   *  * *    * * * **** * *       *    **  * * * *  *** **        * *** 
 58 |** ********* ****** ******** ******** *** ** *** ****** * ************ *** * ************* ******* ** *******  *****  *** * **** **** ** *** ** ** **** * *  *** **  **    ***** 
 54 |************ ****** ********************* ****** ****** ****************** *********************** ** ************************** ***************** ****** * *******  ***** ***** 
 50 |************************************************************************** *********************************************************************** ***************** ***** ***** 
 46 |************************************************************************************************************************************************************************** ******
 42 |*********************************************************************************************************************************************************************************
 37 |*********************************************************************************************************************************************************************************
 33 |*********************************************************************************************************************************************************************************
 29 |*********************************************************************************************************************************************************************************
 25 |*********************************************************************************************************************************************************************************
 21 |*********************************************************************************************************************************************************************************
 16 |*********************************************************************************************************************************************************************************
 12 |*********************************************************************************************************************************************************************************
  8 |*********************************************************************************************************************************************************************************
  4 |*********************************************************************************************************************************************************************************
  0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 84 |                                                                                                                                                                                *
 79 |                                                                                                                                                                              ***
 75 |                                                                                                                                                                     ************
 71 |                                                                                                                                                         ************************
 67 |                                                                                                                           ******************************************************
 63 |                                                                                        *****************************************************************************************
 58 |                                        *****************************************************************************************************************************************
 54 |                *****************************************************************************************************************************************************************
 50 |     ****************************************************************************************************************************************************************************
 46 | ********************************************************************************************************************************************************************************
 42 |*********************************************************************************************************************************************************************************
 37 |*********************************************************************************************************************************************************************************
 33 |*********************************************************************************************************************************************************************************
 29 |*********************************************************************************************************************************************************************************
 25 |*********************************************************************************************************************************************************************************
 21 |*********************************************************************************************************************************************************************************
 16 |*********************************************************************************************************************************************************************************
 12 |*********************************************************************************************************************************************************************************
  8 |*********************************************************************************************************************************************************************************
  4 |*********************************************************************************************************************************************************************************
  0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The downside is that we might have to make a *lot* of dips into the pool, especially if we’re searching a minority namespace.

On the move

This month’s been a busy one… at Wikimedia we’ve done a lot of hopping around getting things together for the fundraiser, and my lady and I spent a few days out west scouting apartments in San Francisco for the relocation. I’m finally caught up with MediaWiki code review, and some dreaded coughing plague from the trip is catching up with me, so it’s time to hit the posting backlog… :D

Found a nice little flat that’s merely twice the cost of the places we were eying in Tampa… believe it or not that is a good deal for SF. :) We’re going to try compensating for part of the cost by going car-free.

Public transit in the city is pretty decent — especially compared to suburban Florida. Most places in the city are within a mile of a light rail line, and buses are both plentiful and frequent.

Further, San Francisco is a fairly car-hostile city. The hills make driving less than fun in many neighborhoods, and parking costs are atrocious. We were in town during municipal elections and watched the city’s voters reject a ballot measure to increase the amount of parking, in favor of allocating more funds for public transit. Even just parking at home would cost us $100 a month to rent a space in the carport out back!

The savings on insurance, parking, maintenance, and payments on a car whose transmission won’t die in the hills should more than pay for the occasional rental for trips out of town. Depending on how much we end up driving, we may actually be saving money versus living in the suburbs… we’d have cheaper rents out in Walnut Creek or Pleasant Hill, but more need to keep a car for everyday tasks.

Wiki dumps… in-dump revision diffs?

In breaks between fundraiser stuff I’m investigating patching up the dumps to behave nicer. The biggest problem to date has been how to get full-history dumps generated in a reasonable amount of time and with greater reliability.

As previously explored, the compression of the files is itself a pretty big part of the burden; cleaning up the bottleneck here could allow improvements in the other processing to shine. Effective compression takes a lot of CPU, though, especially the 7-zip LZMA that does so well on the history dumps.

An idea that gets tossed around from time to time is storing diffs of text revision-to-revision; most edits only change a paragraph or two, so only storing the change can save a lot of space. Any differential system introduces complexity and potentially could be fragile, but it ain’t an awful idea.

Our own internal storage has a frightening amalgam of external database shards, batch compression, and character encoding conversion, which is something we try to hide by doing the dumps as version-independent XML. :)

I’m experimenting a bit with hacking something that looks more or less like a standard unified diff into the exporter, which would be fairly easy to implement a re-patcher for on import.

Testing with a tiny chunk of the English Wikipedia which contains a few thousand revisions of [[Wikipedia:Anarchism|Anarchism]]… the diff-laden version is about 18M for the 3687-revision file, versus 194M for the fully expanded version.

Not bad. :)

7-Zip compresses them both down to about 408K… but the smaller file takes a tenth the time to do so. Even gzip and bzip2 do an order of magnitude better compressing the smaller files.

My first pass adapted the PHP diff class we use for in-wiki diffs… It’s a bit sluggish, but combined with bzip2 compression it beats the diffless version by some margin. Using a faster C++ diff and fixing up the output to be actually usable, this might save a lot of time…

Of course all software using the dumps would have to be updated to understand the diff bits, and I’ll have to decide between in-text diff formatting or light XML markup… :)

Greener Wikimedia Foundation?

Been reading Philip José Farmer’s Dayworld series and found myself thinking about that dang ol’ environment. In the Dayworld universe, a future society “solves” pollution and resource shortages by keeping 1/6 of the world’s population in suspended animation at any given time. Perhaps an extreme solution… :)

Someone once suggested Wikimedia could get into the carbon credits market… which sounds like a lovely scam we should avoid at all costs. ;) But what can we really do to be greener? What is WMF’s actual footprint…?

  • The servers
    We run a few hundred computers 24/7; they and their air conditioning suck up electricity. Could we be more efficient about our power usage? Do our newer 2x quad-core boxes pump more page views per kilowatt than our older machines, and if so should we retire the older ones? Should we investigate blade servers or Sun’s “CoolThreads” systems again?

  • The office
    The main office houses a handful of employees; it’s no BigCo but every bit helps, right? There’s lights, computers, air conditioning, and of course the impact of a few people commuting every day. Moving to San Francisco will make most of those commutes practical by public transportation instead of automobiles, and the more moderate climate could save on electricity spent running the AC.

  • Jet-setting
    We run an international conference every year, as well as smaller meetings and individual speaking engagements. What’s the impact of several hundred people taking transcontinental or intercontinental flights? Can we or should we reduce the amount of travel?

Incremental dumps

A follow-up to my previous notes on dumps

As an optimization to avoid hitting the text storage databases too hard, the wiki XML dumps are done in two passes:

  1. dumpBackup.php --skeleton pulls a consistent snapshot of page and revision metadata to create a “skeleton dump”, without any of the revision text.
  2. dumpTextPass.php reads that XML skeleton, alongside the previous complete XML dump. Revision text that was already present in the previous dump is copied straight over, so only newly created revisions have to be loaded out of the database.

It should be relatively easy to modify this technique to create an incremental dump file, which instead of listing out every page and revision in the entire system would list only those which have changed.

The simplest way to change the dump schema for this might be to add an action attribute to the <page> and <revision> elements, with create, update, and delete values:

<mediawiki>
  <page action="create">
    <!-- Creating a new page -->
    <id>10</id>
    <title>A new page</title>
    <revision action="create">
      <!-- And a new revision. Easy! -->
      <id>100</id>
      <timestamp>2001-01-15T14:03:00Z</title>
      <contributor>...</contributor>
      <text>...</text>
    </revision>
  </page>
  <page action="update">
    <!-- This page has been renamed. Update its record with new values. -->
    <id>11</id>
    <title>New title</title>
    <revision action="create">
      <!-- And a new revision. Easy! -->
      <id>110</id>
      <timestamp>2001-01-15T14:03:00Z</title>
      <contributor>...</contributor>
      <comment>Renamed from "Old title" to "New title"</comment>
      <text>...</text>
    </revision>
  </page>
  <page action="delete">
    <!-- This page has been deleted -->
    <id>12</id>
    <revision action="delete">
      <id>120</id>
    </revision>
  </page>
</mediawiki>

Perhaps those could be moved down to finer granularity for instance to indicate whether a page title was changed or not etc to avoid unnecessary updates, but I’m not sure how much it’d really matter.

There are a few scenarios to take into account as far as interaction with unique keys:

  • Page titles (page_namespace,page_title): a page rename can cause a temporary conflict between two pages between one application and the next.
  • Revision IDs (rev_id): History merges could cause a revision to be ‘added’ to one page, and ‘removed’ from another which appears later in the data set. The insertion would trigger a key conflict.

We could try a preemptive UPDATE to give conflicting pages a non-conflicting temporary title, or we could perhaps use REPLACE INTO instead of INSERT INTO in all cases… that could leave entries deleted during the application, but they should come back later on so the final result is consistent.

In my quick testing, REPLACE performs just as well as INSERT when there are no conflicts, and not _insanely_ bad even when there are (about 80% slower in my unscientific benchmark), so when conflicts are rare that’s probably just fine. At least for MySQL targets. :D

Test imports of ia.wikipedia.org full-history dump; SQL generated by MWDumper, importing into MySQL 5.0, best time for each run:

$ time mysql -u root working < insert.sql 

real    0m20.819s
user    0m5.537s
sys     0m0.648s

Modified to use REPLACE instead of INSERT, on a fresh empty database:

$ time mysql -u root working < replace.sql 

real    0m20.557s
user    0m5.530s
sys     0m0.643s

Importing completely over a full database:

$ time mysql -u root working < replace.sql 

real    0m34.109s
user    0m5.533s
sys     0m0.641s

So that's probably feasible. :)

In theory an incremental dump could be made against a previous skeleton dump as well as against full dumps, which would make it possible to create additional incremental dumps even if full-text dumps fail or are skipped.

Wiki data dumps

There’s a few things we can do to fix up the data dump process again…

  • New machines: Moving the dump runners from the old boxes they’re on to a couple of our newer quad-core boxes should improve things.
  • Improve parallelism: When generating bzip2 files, ensuring that the dbzip2 configuration is set up properly may help.

    For .7z files, not sure… There’s a note in the changelog for p7zip that LZMA compression is multithreaded as of 4.52; if that gives a good speedup on 2x and 4x boxes, that could be very attractive.

    Figuring out more ways to split across machines could be beneficial as well.

  • Improve read speed: 7z is a *lot* faster to decompress than bzip2. Using the prior .7z dump instead of the prior .bz2 could help speed things up, but last time I tried that I had problems with the pipes not closing properly, leading to hangs at the end.
  • More robust dumpTextPass: the longer it takes, the more likely it is to die due to a database burp. If the actual database-reading part is pushed out to a subprocess, that can be easily restarted after an error while the parent process, which is based around reading stub and previous XML files, keeps on going.
  • Splitting files? There’s some thought that dividing the dumps into smaller pieces might be more reliable, as each piece can be re-run separately if it breaks — as well as potentially run in parallel on separate boxes.

It may also be worth dropping the .bz2 files in favor of .7z, especially if we can speed up the compression.

Also have to check if J7Zip can decompress the files and if it’d be possible to integrate it into mwdumper; I don’t like having to rely on an external tool to decompress the files. I hadn’t had any luck with the older java_lzma.

Update: Just for fun, I tried compressing & decompressing a 100 meg SQL file chunk with various proggies on my iMac (x86_64 Linux, 2.16 GHz Core 2 Duo). Parallelism is measured here as (user time / (real time – system time)) as an approximation of multi-core CPU utilization; left out where only one CPU is getting used.

Program Comp time Comp parallelism Decomp time
gzip 10.354 s 1.616 s
bzip2 17.018 s 5.542 s
dbzip2 10.136 s 1.95x
7zr 4.43 81.603 s 1.47x 3.771 s
7za 4.55 98.201 s 1.46x 3.523 s

Nothing seems to be doing parallelism on decompression.

p7zip seems to be using the second CPU for something during compression, but it’s not fully utilized, which leads me to suspect it won’t get any faster on a four-way or eight-way box. dbzip2 should scale up a couple more levels, and can use additional nodes over the network, but you’re still stuck with slow decompression.

Update: I noticed that 7za also can work with .bz2 archives — and it does parallelize decompression! Nice.

Program Comp time Comp parallel Decomp time Decomp parallel
7za 4.55 .bz2 17.668 s 1.99x 2.999 s 1.91x

The compression speed is worse than regular bzip2, but the decompression is faster. Innnnteresting…

Update: Some notes on incremental dump generation