Leopard thoughts

So since I’ve got an iMac with Leopard sitting around in my home office, I figured I’d try actually using it for a while. My impressions so far:

Spotlight: YES! Searching actually seems to work at a reasonable pace so far. Seems to be nicely replacing Quicksilver as an app launcher.

Terminal: YES! Tabs and an integrated SSH agent? I’m sold! Won’t need SSHKeychain anymore.

Spaces: meh Turned it on, haven’t really used it yet.

Finder: meh The new icons are uglier. “Cover flow” for documents seems pretty useless, though vaguely amusing for the ol’ porn folder. ;)

The Dock: meh Its uglier, but who cares? I only see it when I’m actively clicking something on it.

iCal: meh The detail drawer is gone, replaced with annoying popup thingies. Suckage. Update: The new iCal is a fricking disaster. Trying to see or edit details of events is basically impossible. Extra clicks, popups obscuring your view, lack of feedback while changing dates. Someone needs to be shot.

Boot Camp: meh I gave up on dual-booting years ago in my Linux days. Virtualization for the win!

Mail, iChat, Safari: No terribly interesting changes.

Time Machine: haven’t touched it yet.

Wikipedia on Leopard


The Dictionary.app included in Mac OS X 10.5 has support for making lookups to Wikipedia, optionally in various languages.

The actual display of articles seems to be done by loading the page out of the live Wikipedia and doing some custom filtering of it. This isn’t documented to us, so I hope we don’t break it by mistake!

The searching is done via a relatively simple REST protocol to do title-prefix searches as type-ahead suggestions.

Some Apple engineers whipped up a little index search using the DARTS C++ library, with a PHP wrapper extension around it for web output. The results are wrapped in some simple HTML, pretty straightforward to handle.

Once production finally rolled out, though, we encountered some problems:

  1. The number of page titles in the system has increased to the point where a complete index for all languages barely fits in memory on a 32-bit box. I had to break the index in two (English and non-English) just to get it to generate.
  2. Performance was spotty, sometimes mysteriously hanging up for several long seconds. I suspect this is due to the huge indexes loaded in memory; every once in a while something decides to swap.

I finally got my hands on a copy of Leopard to confirm I wasn’t breaking the client, so it’s time to see what I can do…

Rather than investing more resources into the DARTS indexer, I figured I’d see if we can roll this back in with our existing tools to make it easier to maintain.

We already have a type-ahead suggestion backend, which is used for our [[OpenSearch]] interface. If you’re running Firefox 2.0 or later you can pull up the ‘Wikipedia’ search and try it out.

I did some quick testing and confirmed that it was pretty easy to make a translator that would query the OpenSearch suggestion API and format results for the Apple widget; I just had to add a limit option, then a simple re-query and wrap the results.

On my quick benchmarking, performance at least isn’t any worse, and seems to be more consistent so far and gives up to date results — no waiting for the next index generation.

The one big problem right now is that our suggestion search is case-sensitive, since it pulls directly from the binary-collated page title columns in our core database. That’s a minor annoyance except that the Dictionary app sends us queries which have been forced to lowercase — so you can’t easily reach titles with caps past the first letter.

Guess it’s time to bring back the title key field and get that working properly so I can switch in the new version…

HD aspect apathy

In the old days, TV was easy. It always looked like this:

4×3-in-4×3.jpg

If you were snobby about your movies, you might get the widescreen letterbox version (on laserdisc if you were rich, or on VHS if you were me), which slapped some black bars on the top and bottom of your screen as filler:

16×9-in-4×3.jpg

In the brave new world of HDTV, everything’s supposed to be all wide and beautiful, so you don’t have to worry about whether to get the widescreen edition:

16×9-in-16×9.jpg

Cool, right?

Well it would be if everything switched at once. We’ve still got 50 years of standard-def programs, another 50 years of non-widescreen movies before that, and plenty of channels that haven’t even upgraded to broadcast in high-def.

So some of your programs look like this:

16×9-in-16×9.jpg

and some of them look like this:

4×3-in-16×9.jpg

Hey, we can live with that. There’s just one problem… a lot of shows are shooting widescreen like this:

16×9-in-16×9.jpg

…but broadcasting in letterbox on standard-def channels like this…

16×9-in-4×3.jpg

…which show up on your fancy high-def set like this:

16×9-in-4×3-in-16×9.jpg

Hey! Now the black bars are all the way around — don’t you feel cheated? The picture’s so tiny!

For the first couple years after I got my little high-def set, I always dutifully fiddled with the zoom settings when I came across a letterboxed program. For some mysterious reason, you usually get wayyyy too many options, like this:

16×9-in-4×3-in-16×9.jpg

Ok, we start with our picture too tiny…

letterbox-toowide.jpg

Crap, now it’s all stretchy!

16×9-in-16×9.jpg

Ahh, now it’s just right.

letterbox-toobig.jpg

Click again and sometimes you get it so zoomed in that you can only see the middle! WTF?

16×9-in-4×3-in-16×9.jpg

…and back to full view.

Every time I switched channels I’d have to go through this dance, trying to cycle back to whichever mode was right for that particular program. Needless to say this makes TV watching a bit of a bother.

After my fiancée and I moved in together, I was horrified to discover that she never touched the zoom button. If there were bars around the whole screen, she didn’t care. If I’d left it zoomed in and the top and bottom of the program were cut off, it didn’t bother her none. I’d just sit there until the urge overpowered me and I’d grab the remote control to adjust the zoom while she rolled her eyes.

When we moved from Florida to California we changed cable companies… new cable box, new remote… To my horror, the zoom button didn’t work! I frantically pushed it for a good minute, until I realized that the bars bother me less than the zoom dance.

I have attained HD zoom apathy, and it is good.

All I want for Christmas…

Our stuff finally came in from Florida the other day, so every day’s going to be like X-mas for a while. ;)

In the bigger picture, though, I’d love for California to finally get its act together on high-speed rail connecting the state’s metropolitan areas.

California’s not a small state; in land area and population it’s comparable to large European countries. Getting from San Francisco down to the Los Angeles suburbs where my family lives is about a 400-mile drive (600ish km), a weary 6-8 hours on the highway. It’s faster by air — but for the hour you spend on the plane you’ll spend two in the airport, and you’ll have to add local transport on the other end to your ticket price.

There needs to be something in between, and that spot’s just begging for a modern high-speed rail system.

Less check-in and security hassle than the airport means it’ll be as fast as flying at a lower price point, and for the greenies it won’t be burning jet fuel the whole way. ;)

Metro subways and commuter rail have good traction in many US cities, but long-distance rail is something most Americans today don’t take seriously. And is it any wonder? Today’s Amtrak combines the low speed of a cross-country road trip with the high price of air travel…

When I was planning my move here, a German comrade jokingly suggested I go by rail instead of flying, as it was “only in-country”. ;) For fun I looked it up; taking Amtrak passenger rail from Orlando, Florida to San Francisco, California would take 93 hours of actual travel time, plus stopovers between the three separate rail lines it would take. Our direct flight took just 6 hours… and cost about the same, around $200 a seat.

Cal-ih-for-nye-ay!

I’m pretty much up and running in San Francisco, and officially back to work as of today. I’ll be working from home for a couple more weeks until the office space is up and running…

Our stuff hasn’t arrived from Florida yet, so the apartment’s a little empty, but we’re intact and online with two cats, an air mattress, and a laptop. Desks? Bah! Who needs desks? :)

R.I.P. Casio Exilim

So I’m smack in the middle of moving cross-country, packing up all our stuff to load on a truck, never to be seen again which will arrive in a few days at our new home. Time to take pictures of everything for reference and to compare potential damage.

Naturally this is the perfect time for my camera to die… Especially since *two days previously* I’d sold our barely-used spare camera on EBay.

Well, I guess I’ve got a new camera fund from the proceeds. :D

The old girl had served me well for two and a half years, through Los Angeles, Frankfurt, Berlin, Boston, Copenhagen, Stockholm, Tampa, Taipei and San Francisco.

Perhaps the temperature and humidity extremes finally took their toll; the LCD screen just gave up the ghost and displays shiny colors instead of something useful like… a screen… which makes it a bit hard to aim, select options in the menu, etc.

P.S. The Golden Compass is awesome!

Update: Replaced the old cam with a Canon PowerShot SD870 IS… about the same size as the old one, but fixing most of my issues with the old Exilim:

  • Auto-rotation — the camera detects its orientation and marks the photos as rotated accordingly. I used to spend lots of time with my old one going through and rotating half of my pics after a shoot.
  • Better in low light shooting (eg indoor or evening) — higher sensitivity (ISO 1600 vs 400), image stabilization, and a mode to auto-adjust the ISO a bit higher
  • Continuous shooting mode — if I invest in new memory cards, it’ll take high-speed ones which should allow a better frame rate for those “I hope that cool thing happens while the shutter’s open” moments.
  • Better video quality — video mode shoots up to 640×480 30fps (vs 320×240 15fps). Like my old cam it’s saved as Motion-JPEG, which is very space inefficient, but I can recode for archival if I start using video mode more.
  • Time-lapse mode on the video! Neat. :D
  • Doesn’t need a dock to connect to the computer or charge.
  • More megapixels (8 vs 5) — yawn. I’m rarely going to *need* that many pixels. ;)

The menu and controls are a bit different but I’ll get the hang of it.

One minor nit is that it doesn’t look like it’ll charge the battery through USB; that would be *very* nice when traveling. The default kit includes a separate battery charger rather than an AC adaptor for the camera itself, but it’s blissfully compact and according to the specs it should work on 220 Hz volts, so traveling in Europe should be ok.

Random tests

We used to have a lot of complaints about the random page feature on Wikipedia/MediaWiki picking some pages *very* frequently. After various tweaking it got improved enough that we don’t hear much about it, but it’s still not very even. To get an idea of how big the problem remains, I did a little testing of the distribution of selections on my local test database, which is pretty fragmented and weird with various imports and test pages and mass deletions in its history. :)

I did these tests by grabbing random page_id values, with enough runs to select each page 100 times given a perfectly even distribution.

The current system uses a special index field, page_random. The field contains a random number between 0 and 1.0 for each page; when selecting pages, we pick a random float and have the database grab the first valid page with an index greater than or equal to that number. Ideally, the distribution of page_random values would be perfect — certainly it’s better than the page_id distribution! — and we should get fairly even selections.

But it’s not perfect, as there are going to be gaps of differing sizes between entries, and entries with large gaps before them will be predisposed to get more hits. In my test db, the most-frequently picked pages are five times as likely to be selected as ideal, and other pages are very unlikely to come up.

(The first graph is sorted by page_id; the second by hits.)

 513 |  *                                                                                                                                                                             
 487 |  *                                                                                                                                                                             
 461 |  *                                                                                                                                                                             
 436 |  *            *                                                                                                                                                                
 410 | **            *                   *                                                                                                                                            
 384 | **            *                   *                                                                                                                                            
 359 | **            *                   *                                                                                                                                            
 333 | ****          *   *               *                           *                  *                                                                                             
 307 | ****   **   * *   *            *  *                           *                  *    *                                                                                        
 282 | ****  ***   * *   *     *      *  *       *                  **                  *    *                                                                                        
 256 | ****  ***   * *   ** *  *      *  *     * *                  **                  *    *                            * *       *                                                 
 230 | ****  ***   * *   ** *  *      *  *     * *  *               **                  *    *                            * *       *                                                 
 205 | ****  ***  ** *   ** *  *      *  *   * * ** *               **     *     *      *    *                            * *       *                                                 
 179 | ****  ***  ** * * ** *  *    * *  *   * * ** **       *      **     *   * *      *  * **                     *     * *       *                                                 
 153 | **** ****  ** * * ** * ** *  * *  *   * * ** **       *      **     *   * *      *  * **                     *     * *       *                                                 
 128 | **** ****  ** * **** * ** ** * *  **  * * ** **     * **     **     * * * *      *  * **             *       ** *  * *       *    *                                            
 102 |***** ****  ** * **** **** **** *  **  * **** **   * * ****   **   *** * * *  *   * ** **            **    *  ** *  * *       *    **                                           
  76 |**********  ** * **** ***********  **  ****** ** * *** ****   **** *** * * *  **  * ***** * *  *   * **    *  ** *  ***       **** **   **      *     *                         
  51 |*********** ********************* **** *************** **** ****** *** * * ** ** ********** *  * *** **  ***  ** *  ***  ********* ***  **   ** *     *  *                      
  25 |************************************** *************** *************** ********* ********** **** *********** ****** ******************  **  *** * **  * **   * ***** *          
   0 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 513 |                                                                                                                                                                               *
 487 |                                                                                                                                                                               *
 461 |                                                                                                                                                                               *
 436 |                                                                                                                                                                              **
 410 |                                                                                                                                                                            ****
 384 |                                                                                                                                                                            ****
 359 |                                                                                                                                                                            ****
 333 |                                                                                                                                                                       *********
 307 |                                                                                                                                                                  **************
 282 |                                                                                                                                                              ******************
 256 |                                                                                                                                                        ************************
 230 |                                                                                                                                                       *************************
 205 |                                                                                                                                                  ******************************
 179 |                                                                                                                                          **************************************
 153 |                                                                                                                                       *****************************************
 128 |                                                                                                                             ***************************************************
 102 |                                                                                                               *****************************************************************
  76 |                                                                                         ***************************************************************************************
  51 |                                                             *******************************************************************************************************************
  25 |                                ************************************************************************************************************************************************
   0 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

As an example using page_id instead of page_random, the distribution is *much* worse. ;)

 13962 |*                                                                                                                                                                                
 13263 |*                                                                                                                                                                                
 12565 |*                                                                                                                                                                                
 11867 |*                                                                                                                                                                                
 11169 |*                                                                                                                                                                                
 10471 |*                                                                                                                                                                                
  9773 |*                                                                                                                                                                                
  9075 |*                                                                                                                                                                                
  8377 |*                                                                                                                                                                                
  7679 |*                                                                                                                                                                                
  6981 |*                                                                                                                                                                                
  6282 |*                                                                                                                                                                                
  5584 |*                                                                                                                                                                                
  4886 |*                                                                                                                                                                                
  4188 |*                                                                                                                                                                                
  3490 |*                                                                                                                                                                                
  2792 |*                                                                                                                                                                                
  2094 |*                                                                                                                                                                                
  1396 |*                                                                                                                                                                                
   698 |**                                                                                                                                                                               
     0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 13962 |                                                                                                                                                                                *
 13263 |                                                                                                                                                                                *
 12565 |                                                                                                                                                                                *
 11867 |                                                                                                                                                                                *
 11169 |                                                                                                                                                                                *
 10471 |                                                                                                                                                                                *
  9773 |                                                                                                                                                                                *
  9075 |                                                                                                                                                                                *
  8377 |                                                                                                                                                                                *
  7679 |                                                                                                                                                                                *
  6981 |                                                                                                                                                                                *
  6282 |                                                                                                                                                                                *
  5584 |                                                                                                                                                                                *
  4886 |                                                                                                                                                                                *
  4188 |                                                                                                                                                                                *
  3490 |                                                                                                                                                                                *
  2792 |                                                                                                                                                                                *
  2094 |                                                                                                                                                                                *
  1396 |                                                                                                                                                                                *
   698 |                                                                                                                                                                               **
     0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

But we could instead keep picking random page_id values until we hit a valid page (matching the namespace and not-a-redirect criteria). This gives us a much more even distribution:

 84 |                                                                                                                                                          *                      
 79 |                                                                        *                                                          *                      *                      
 75 |      *                                      *    *        *     *      *                    *     *                        *      *                  *   *                      
 71 |      *                *         *           *    *     *  *   * **     *            *       *   * *                        ** *   *       *     *    *   *                   *  
 67 | * *  **               **       ***        * **   *** * * ** * * ***    *     *    ***   *   **  * *  * *      * *    * *   ****   *       *    **    * * *                * **  
 63 | * ** ** * * * *    ***** *  * *****    * ** ***  *** * * ** ********   *  *  * * ****  **   ***** ** * *   *  * *    * * * **** * *       *    **  * * * *  *** **        * *** 
 58 |** ********* ****** ******** ******** *** ** *** ****** * ************ *** * ************* ******* ** *******  *****  *** * **** **** ** *** ** ** **** * *  *** **  **    ***** 
 54 |************ ****** ********************* ****** ****** ****************** *********************** ** ************************** ***************** ****** * *******  ***** ***** 
 50 |************************************************************************** *********************************************************************** ***************** ***** ***** 
 46 |************************************************************************************************************************************************************************** ******
 42 |*********************************************************************************************************************************************************************************
 37 |*********************************************************************************************************************************************************************************
 33 |*********************************************************************************************************************************************************************************
 29 |*********************************************************************************************************************************************************************************
 25 |*********************************************************************************************************************************************************************************
 21 |*********************************************************************************************************************************************************************************
 16 |*********************************************************************************************************************************************************************************
 12 |*********************************************************************************************************************************************************************************
  8 |*********************************************************************************************************************************************************************************
  4 |*********************************************************************************************************************************************************************************
  0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 84 |                                                                                                                                                                                *
 79 |                                                                                                                                                                              ***
 75 |                                                                                                                                                                     ************
 71 |                                                                                                                                                         ************************
 67 |                                                                                                                           ******************************************************
 63 |                                                                                        *****************************************************************************************
 58 |                                        *****************************************************************************************************************************************
 54 |                *****************************************************************************************************************************************************************
 50 |     ****************************************************************************************************************************************************************************
 46 | ********************************************************************************************************************************************************************************
 42 |*********************************************************************************************************************************************************************************
 37 |*********************************************************************************************************************************************************************************
 33 |*********************************************************************************************************************************************************************************
 29 |*********************************************************************************************************************************************************************************
 25 |*********************************************************************************************************************************************************************************
 21 |*********************************************************************************************************************************************************************************
 16 |*********************************************************************************************************************************************************************************
 12 |*********************************************************************************************************************************************************************************
  8 |*********************************************************************************************************************************************************************************
  4 |*********************************************************************************************************************************************************************************
  0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The downside is that we might have to make a *lot* of dips into the pool, especially if we’re searching a minority namespace.

On the move

This month’s been a busy one… at Wikimedia we’ve done a lot of hopping around getting things together for the fundraiser, and my lady and I spent a few days out west scouting apartments in San Francisco for the relocation. I’m finally caught up with MediaWiki code review, and some dreaded coughing plague from the trip is catching up with me, so it’s time to hit the posting backlog… :D

Found a nice little flat that’s merely twice the cost of the places we were eying in Tampa… believe it or not that is a good deal for SF. :) We’re going to try compensating for part of the cost by going car-free.

Public transit in the city is pretty decent — especially compared to suburban Florida. Most places in the city are within a mile of a light rail line, and buses are both plentiful and frequent.

Further, San Francisco is a fairly car-hostile city. The hills make driving less than fun in many neighborhoods, and parking costs are atrocious. We were in town during municipal elections and watched the city’s voters reject a ballot measure to increase the amount of parking, in favor of allocating more funds for public transit. Even just parking at home would cost us $100 a month to rent a space in the carport out back!

The savings on insurance, parking, maintenance, and payments on a car whose transmission won’t die in the hills should more than pay for the occasional rental for trips out of town. Depending on how much we end up driving, we may actually be saving money versus living in the suburbs… we’d have cheaper rents out in Walnut Creek or Pleasant Hill, but more need to keep a car for everyday tasks.

Wiki dumps… in-dump revision diffs?

In breaks between fundraiser stuff I’m investigating patching up the dumps to behave nicer. The biggest problem to date has been how to get full-history dumps generated in a reasonable amount of time and with greater reliability.

As previously explored, the compression of the files is itself a pretty big part of the burden; cleaning up the bottleneck here could allow improvements in the other processing to shine. Effective compression takes a lot of CPU, though, especially the 7-zip LZMA that does so well on the history dumps.

An idea that gets tossed around from time to time is storing diffs of text revision-to-revision; most edits only change a paragraph or two, so only storing the change can save a lot of space. Any differential system introduces complexity and potentially could be fragile, but it ain’t an awful idea.

Our own internal storage has a frightening amalgam of external database shards, batch compression, and character encoding conversion, which is something we try to hide by doing the dumps as version-independent XML. :)

I’m experimenting a bit with hacking something that looks more or less like a standard unified diff into the exporter, which would be fairly easy to implement a re-patcher for on import.

Testing with a tiny chunk of the English Wikipedia which contains a few thousand revisions of [[Wikipedia:Anarchism|Anarchism]]… the diff-laden version is about 18M for the 3687-revision file, versus 194M for the fully expanded version.

Not bad. :)

7-Zip compresses them both down to about 408K… but the smaller file takes a tenth the time to do so. Even gzip and bzip2 do an order of magnitude better compressing the smaller files.

My first pass adapted the PHP diff class we use for in-wiki diffs… It’s a bit sluggish, but combined with bzip2 compression it beats the diffless version by some margin. Using a faster C++ diff and fixing up the output to be actually usable, this might save a lot of time…

Of course all software using the dumps would have to be updated to understand the diff bits, and I’ll have to decide between in-text diff formatting or light XML markup… :)