Wikipedia on Leopard


The Dictionary.app included in Mac OS X 10.5 has support for making lookups to Wikipedia, optionally in various languages.

The actual display of articles seems to be done by loading the page out of the live Wikipedia and doing some custom filtering of it. This isn’t documented to us, so I hope we don’t break it by mistake!

The searching is done via a relatively simple REST protocol to do title-prefix searches as type-ahead suggestions.

Some Apple engineers whipped up a little index search using the DARTS C++ library, with a PHP wrapper extension around it for web output. The results are wrapped in some simple HTML, pretty straightforward to handle.

Once production finally rolled out, though, we encountered some problems:

  1. The number of page titles in the system has increased to the point where a complete index for all languages barely fits in memory on a 32-bit box. I had to break the index in two (English and non-English) just to get it to generate.
  2. Performance was spotty, sometimes mysteriously hanging up for several long seconds. I suspect this is due to the huge indexes loaded in memory; every once in a while something decides to swap.

I finally got my hands on a copy of Leopard to confirm I wasn’t breaking the client, so it’s time to see what I can do…

Rather than investing more resources into the DARTS indexer, I figured I’d see if we can roll this back in with our existing tools to make it easier to maintain.

We already have a type-ahead suggestion backend, which is used for our [[OpenSearch]] interface. If you’re running Firefox 2.0 or later you can pull up the ‘Wikipedia’ search and try it out.

I did some quick testing and confirmed that it was pretty easy to make a translator that would query the OpenSearch suggestion API and format results for the Apple widget; I just had to add a limit option, then a simple re-query and wrap the results.

On my quick benchmarking, performance at least isn’t any worse, and seems to be more consistent so far and gives up to date results — no waiting for the next index generation.

The one big problem right now is that our suggestion search is case-sensitive, since it pulls directly from the binary-collated page title columns in our core database. That’s a minor annoyance except that the Dictionary app sends us queries which have been forced to lowercase — so you can’t easily reach titles with caps past the first letter.

Guess it’s time to bring back the title key field and get that working properly so I can switch in the new version…