Sweeeet

Hadn’t noticed this before… on Leopard, when you do a window screenshot (command-shift-4, space) it now captures the window’s drop shadow over a transparent background.

Shadow! Shit yeah

That’s pretty cool for demo screenshots; I used to use temporary white backgrounds and capture an area around the window manually, but this is way prettier. :D

Leopard Spotlight

Spotlight keeps deciding it has to index my external hard drive all over again. Is this going to happen every time I reboot? Or is it just because I almost never have to boot unless I’m recovering from a crash or power outage?

Sigh. At least it lets me search the internal drive while it’s doing it.

Wiki data dumps restarted

Maintenance is still pending on the old dump server… I’ve moved the files over to storage2, one of our backup servers, and restarted a couple of dump worker threads. Currently one of those is running on the old server, but it won’t be too fatal if it dies for now. :)

TitleBlacklist, title protection

A few days ago support for protecting deleted or not-yet-created pages from creation went live. Today I’ve also enabled Vasiliev’s TitleBlacklist extension, which allows admins to preemptively block potential titles via regular expressions.

Currently just the local blacklist is enabled, at MediaWiki:Titleblacklist on each wiki.

The regex behavior is a little different from the existing SpamBlacklist, so admins be sure to check the docs and test your entries. :) But it should come in rather handy for certain kinds of spam and silliness attacks.

Leopard thoughts

So since I’ve got an iMac with Leopard sitting around in my home office, I figured I’d try actually using it for a while. My impressions so far:

Spotlight: YES! Searching actually seems to work at a reasonable pace so far. Seems to be nicely replacing Quicksilver as an app launcher.

Terminal: YES! Tabs and an integrated SSH agent? I’m sold! Won’t need SSHKeychain anymore.

Spaces: meh Turned it on, haven’t really used it yet.

Finder: meh The new icons are uglier. “Cover flow” for documents seems pretty useless, though vaguely amusing for the ol’ porn folder. ;)

The Dock: meh Its uglier, but who cares? I only see it when I’m actively clicking something on it.

iCal: meh The detail drawer is gone, replaced with annoying popup thingies. Suckage. Update: The new iCal is a fricking disaster. Trying to see or edit details of events is basically impossible. Extra clicks, popups obscuring your view, lack of feedback while changing dates. Someone needs to be shot.

Boot Camp: meh I gave up on dual-booting years ago in my Linux days. Virtualization for the win!

Mail, iChat, Safari: No terribly interesting changes.

Time Machine: haven’t touched it yet.

Wikipedia on Leopard


The Dictionary.app included in Mac OS X 10.5 has support for making lookups to Wikipedia, optionally in various languages.

The actual display of articles seems to be done by loading the page out of the live Wikipedia and doing some custom filtering of it. This isn’t documented to us, so I hope we don’t break it by mistake!

The searching is done via a relatively simple REST protocol to do title-prefix searches as type-ahead suggestions.

Some Apple engineers whipped up a little index search using the DARTS C++ library, with a PHP wrapper extension around it for web output. The results are wrapped in some simple HTML, pretty straightforward to handle.

Once production finally rolled out, though, we encountered some problems:

  1. The number of page titles in the system has increased to the point where a complete index for all languages barely fits in memory on a 32-bit box. I had to break the index in two (English and non-English) just to get it to generate.
  2. Performance was spotty, sometimes mysteriously hanging up for several long seconds. I suspect this is due to the huge indexes loaded in memory; every once in a while something decides to swap.

I finally got my hands on a copy of Leopard to confirm I wasn’t breaking the client, so it’s time to see what I can do…

Rather than investing more resources into the DARTS indexer, I figured I’d see if we can roll this back in with our existing tools to make it easier to maintain.

We already have a type-ahead suggestion backend, which is used for our [[OpenSearch]] interface. If you’re running Firefox 2.0 or later you can pull up the ‘Wikipedia’ search and try it out.

I did some quick testing and confirmed that it was pretty easy to make a translator that would query the OpenSearch suggestion API and format results for the Apple widget; I just had to add a limit option, then a simple re-query and wrap the results.

On my quick benchmarking, performance at least isn’t any worse, and seems to be more consistent so far and gives up to date results — no waiting for the next index generation.

The one big problem right now is that our suggestion search is case-sensitive, since it pulls directly from the binary-collated page title columns in our core database. That’s a minor annoyance except that the Dictionary app sends us queries which have been forced to lowercase — so you can’t easily reach titles with caps past the first letter.

Guess it’s time to bring back the title key field and get that working properly so I can switch in the new version…

Cal-ih-for-nye-ay!

I’m pretty much up and running in San Francisco, and officially back to work as of today. I’ll be working from home for a couple more weeks until the office space is up and running…

Our stuff hasn’t arrived from Florida yet, so the apartment’s a little empty, but we’re intact and online with two cats, an air mattress, and a laptop. Desks? Bah! Who needs desks? :)

Random tests

We used to have a lot of complaints about the random page feature on Wikipedia/MediaWiki picking some pages *very* frequently. After various tweaking it got improved enough that we don’t hear much about it, but it’s still not very even. To get an idea of how big the problem remains, I did a little testing of the distribution of selections on my local test database, which is pretty fragmented and weird with various imports and test pages and mass deletions in its history. :)

I did these tests by grabbing random page_id values, with enough runs to select each page 100 times given a perfectly even distribution.

The current system uses a special index field, page_random. The field contains a random number between 0 and 1.0 for each page; when selecting pages, we pick a random float and have the database grab the first valid page with an index greater than or equal to that number. Ideally, the distribution of page_random values would be perfect — certainly it’s better than the page_id distribution! — and we should get fairly even selections.

But it’s not perfect, as there are going to be gaps of differing sizes between entries, and entries with large gaps before them will be predisposed to get more hits. In my test db, the most-frequently picked pages are five times as likely to be selected as ideal, and other pages are very unlikely to come up.

(The first graph is sorted by page_id; the second by hits.)

 513 |  *                                                                                                                                                                             
 487 |  *                                                                                                                                                                             
 461 |  *                                                                                                                                                                             
 436 |  *            *                                                                                                                                                                
 410 | **            *                   *                                                                                                                                            
 384 | **            *                   *                                                                                                                                            
 359 | **            *                   *                                                                                                                                            
 333 | ****          *   *               *                           *                  *                                                                                             
 307 | ****   **   * *   *            *  *                           *                  *    *                                                                                        
 282 | ****  ***   * *   *     *      *  *       *                  **                  *    *                                                                                        
 256 | ****  ***   * *   ** *  *      *  *     * *                  **                  *    *                            * *       *                                                 
 230 | ****  ***   * *   ** *  *      *  *     * *  *               **                  *    *                            * *       *                                                 
 205 | ****  ***  ** *   ** *  *      *  *   * * ** *               **     *     *      *    *                            * *       *                                                 
 179 | ****  ***  ** * * ** *  *    * *  *   * * ** **       *      **     *   * *      *  * **                     *     * *       *                                                 
 153 | **** ****  ** * * ** * ** *  * *  *   * * ** **       *      **     *   * *      *  * **                     *     * *       *                                                 
 128 | **** ****  ** * **** * ** ** * *  **  * * ** **     * **     **     * * * *      *  * **             *       ** *  * *       *    *                                            
 102 |***** ****  ** * **** **** **** *  **  * **** **   * * ****   **   *** * * *  *   * ** **            **    *  ** *  * *       *    **                                           
  76 |**********  ** * **** ***********  **  ****** ** * *** ****   **** *** * * *  **  * ***** * *  *   * **    *  ** *  ***       **** **   **      *     *                         
  51 |*********** ********************* **** *************** **** ****** *** * * ** ** ********** *  * *** **  ***  ** *  ***  ********* ***  **   ** *     *  *                      
  25 |************************************** *************** *************** ********* ********** **** *********** ****** ******************  **  *** * **  * **   * ***** *          
   0 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 513 |                                                                                                                                                                               *
 487 |                                                                                                                                                                               *
 461 |                                                                                                                                                                               *
 436 |                                                                                                                                                                              **
 410 |                                                                                                                                                                            ****
 384 |                                                                                                                                                                            ****
 359 |                                                                                                                                                                            ****
 333 |                                                                                                                                                                       *********
 307 |                                                                                                                                                                  **************
 282 |                                                                                                                                                              ******************
 256 |                                                                                                                                                        ************************
 230 |                                                                                                                                                       *************************
 205 |                                                                                                                                                  ******************************
 179 |                                                                                                                                          **************************************
 153 |                                                                                                                                       *****************************************
 128 |                                                                                                                             ***************************************************
 102 |                                                                                                               *****************************************************************
  76 |                                                                                         ***************************************************************************************
  51 |                                                             *******************************************************************************************************************
  25 |                                ************************************************************************************************************************************************
   0 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

As an example using page_id instead of page_random, the distribution is *much* worse. ;)

 13962 |*                                                                                                                                                                                
 13263 |*                                                                                                                                                                                
 12565 |*                                                                                                                                                                                
 11867 |*                                                                                                                                                                                
 11169 |*                                                                                                                                                                                
 10471 |*                                                                                                                                                                                
  9773 |*                                                                                                                                                                                
  9075 |*                                                                                                                                                                                
  8377 |*                                                                                                                                                                                
  7679 |*                                                                                                                                                                                
  6981 |*                                                                                                                                                                                
  6282 |*                                                                                                                                                                                
  5584 |*                                                                                                                                                                                
  4886 |*                                                                                                                                                                                
  4188 |*                                                                                                                                                                                
  3490 |*                                                                                                                                                                                
  2792 |*                                                                                                                                                                                
  2094 |*                                                                                                                                                                                
  1396 |*                                                                                                                                                                                
   698 |**                                                                                                                                                                               
     0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 13962 |                                                                                                                                                                                *
 13263 |                                                                                                                                                                                *
 12565 |                                                                                                                                                                                *
 11867 |                                                                                                                                                                                *
 11169 |                                                                                                                                                                                *
 10471 |                                                                                                                                                                                *
  9773 |                                                                                                                                                                                *
  9075 |                                                                                                                                                                                *
  8377 |                                                                                                                                                                                *
  7679 |                                                                                                                                                                                *
  6981 |                                                                                                                                                                                *
  6282 |                                                                                                                                                                                *
  5584 |                                                                                                                                                                                *
  4886 |                                                                                                                                                                                *
  4188 |                                                                                                                                                                                *
  3490 |                                                                                                                                                                                *
  2792 |                                                                                                                                                                                *
  2094 |                                                                                                                                                                                *
  1396 |                                                                                                                                                                                *
   698 |                                                                                                                                                                               **
     0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

But we could instead keep picking random page_id values until we hit a valid page (matching the namespace and not-a-redirect criteria). This gives us a much more even distribution:

 84 |                                                                                                                                                          *                      
 79 |                                                                        *                                                          *                      *                      
 75 |      *                                      *    *        *     *      *                    *     *                        *      *                  *   *                      
 71 |      *                *         *           *    *     *  *   * **     *            *       *   * *                        ** *   *       *     *    *   *                   *  
 67 | * *  **               **       ***        * **   *** * * ** * * ***    *     *    ***   *   **  * *  * *      * *    * *   ****   *       *    **    * * *                * **  
 63 | * ** ** * * * *    ***** *  * *****    * ** ***  *** * * ** ********   *  *  * * ****  **   ***** ** * *   *  * *    * * * **** * *       *    **  * * * *  *** **        * *** 
 58 |** ********* ****** ******** ******** *** ** *** ****** * ************ *** * ************* ******* ** *******  *****  *** * **** **** ** *** ** ** **** * *  *** **  **    ***** 
 54 |************ ****** ********************* ****** ****** ****************** *********************** ** ************************** ***************** ****** * *******  ***** ***** 
 50 |************************************************************************** *********************************************************************** ***************** ***** ***** 
 46 |************************************************************************************************************************************************************************** ******
 42 |*********************************************************************************************************************************************************************************
 37 |*********************************************************************************************************************************************************************************
 33 |*********************************************************************************************************************************************************************************
 29 |*********************************************************************************************************************************************************************************
 25 |*********************************************************************************************************************************************************************************
 21 |*********************************************************************************************************************************************************************************
 16 |*********************************************************************************************************************************************************************************
 12 |*********************************************************************************************************************************************************************************
  8 |*********************************************************************************************************************************************************************************
  4 |*********************************************************************************************************************************************************************************
  0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 84 |                                                                                                                                                                                *
 79 |                                                                                                                                                                              ***
 75 |                                                                                                                                                                     ************
 71 |                                                                                                                                                         ************************
 67 |                                                                                                                           ******************************************************
 63 |                                                                                        *****************************************************************************************
 58 |                                        *****************************************************************************************************************************************
 54 |                *****************************************************************************************************************************************************************
 50 |     ****************************************************************************************************************************************************************************
 46 | ********************************************************************************************************************************************************************************
 42 |*********************************************************************************************************************************************************************************
 37 |*********************************************************************************************************************************************************************************
 33 |*********************************************************************************************************************************************************************************
 29 |*********************************************************************************************************************************************************************************
 25 |*********************************************************************************************************************************************************************************
 21 |*********************************************************************************************************************************************************************************
 16 |*********************************************************************************************************************************************************************************
 12 |*********************************************************************************************************************************************************************************
  8 |*********************************************************************************************************************************************************************************
  4 |*********************************************************************************************************************************************************************************
  0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The downside is that we might have to make a *lot* of dips into the pool, especially if we’re searching a minority namespace.

On the move

This month’s been a busy one… at Wikimedia we’ve done a lot of hopping around getting things together for the fundraiser, and my lady and I spent a few days out west scouting apartments in San Francisco for the relocation. I’m finally caught up with MediaWiki code review, and some dreaded coughing plague from the trip is catching up with me, so it’s time to hit the posting backlog… :D

Found a nice little flat that’s merely twice the cost of the places we were eying in Tampa… believe it or not that is a good deal for SF. :) We’re going to try compensating for part of the cost by going car-free.

Public transit in the city is pretty decent — especially compared to suburban Florida. Most places in the city are within a mile of a light rail line, and buses are both plentiful and frequent.

Further, San Francisco is a fairly car-hostile city. The hills make driving less than fun in many neighborhoods, and parking costs are atrocious. We were in town during municipal elections and watched the city’s voters reject a ballot measure to increase the amount of parking, in favor of allocating more funds for public transit. Even just parking at home would cost us $100 a month to rent a space in the carport out back!

The savings on insurance, parking, maintenance, and payments on a car whose transmission won’t die in the hills should more than pay for the occasional rental for trips out of town. Depending on how much we end up driving, we may actually be saving money versus living in the suburbs… we’d have cheaper rents out in Walnut Creek or Pleasant Hill, but more need to keep a car for everyday tasks.

Wiki dumps… in-dump revision diffs?

In breaks between fundraiser stuff I’m investigating patching up the dumps to behave nicer. The biggest problem to date has been how to get full-history dumps generated in a reasonable amount of time and with greater reliability.

As previously explored, the compression of the files is itself a pretty big part of the burden; cleaning up the bottleneck here could allow improvements in the other processing to shine. Effective compression takes a lot of CPU, though, especially the 7-zip LZMA that does so well on the history dumps.

An idea that gets tossed around from time to time is storing diffs of text revision-to-revision; most edits only change a paragraph or two, so only storing the change can save a lot of space. Any differential system introduces complexity and potentially could be fragile, but it ain’t an awful idea.

Our own internal storage has a frightening amalgam of external database shards, batch compression, and character encoding conversion, which is something we try to hide by doing the dumps as version-independent XML. :)

I’m experimenting a bit with hacking something that looks more or less like a standard unified diff into the exporter, which would be fairly easy to implement a re-patcher for on import.

Testing with a tiny chunk of the English Wikipedia which contains a few thousand revisions of [[Wikipedia:Anarchism|Anarchism]]… the diff-laden version is about 18M for the 3687-revision file, versus 194M for the fully expanded version.

Not bad. :)

7-Zip compresses them both down to about 408K… but the smaller file takes a tenth the time to do so. Even gzip and bzip2 do an order of magnitude better compressing the smaller files.

My first pass adapted the PHP diff class we use for in-wiki diffs… It’s a bit sluggish, but combined with bzip2 compression it beats the diffless version by some margin. Using a faster C++ diff and fixing up the output to be actually usable, this might save a lot of time…

Of course all software using the dumps would have to be updated to understand the diff bits, and I’ll have to decide between in-text diff formatting or light XML markup… :)