Random tests

We used to have a lot of complaints about the random page feature on Wikipedia/MediaWiki picking some pages *very* frequently. After various tweaking it got improved enough that we don’t hear much about it, but it’s still not very even. To get an idea of how big the problem remains, I did a little testing of the distribution of selections on my local test database, which is pretty fragmented and weird with various imports and test pages and mass deletions in its history. :)

I did these tests by grabbing random page_id values, with enough runs to select each page 100 times given a perfectly even distribution.

The current system uses a special index field, page_random. The field contains a random number between 0 and 1.0 for each page; when selecting pages, we pick a random float and have the database grab the first valid page with an index greater than or equal to that number. Ideally, the distribution of page_random values would be perfect — certainly it’s better than the page_id distribution! — and we should get fairly even selections.

But it’s not perfect, as there are going to be gaps of differing sizes between entries, and entries with large gaps before them will be predisposed to get more hits. In my test db, the most-frequently picked pages are five times as likely to be selected as ideal, and other pages are very unlikely to come up.

(The first graph is sorted by page_id; the second by hits.)

 513 |  *                                                                                                                                                                             
 487 |  *                                                                                                                                                                             
 461 |  *                                                                                                                                                                             
 436 |  *            *                                                                                                                                                                
 410 | **            *                   *                                                                                                                                            
 384 | **            *                   *                                                                                                                                            
 359 | **            *                   *                                                                                                                                            
 333 | ****          *   *               *                           *                  *                                                                                             
 307 | ****   **   * *   *            *  *                           *                  *    *                                                                                        
 282 | ****  ***   * *   *     *      *  *       *                  **                  *    *                                                                                        
 256 | ****  ***   * *   ** *  *      *  *     * *                  **                  *    *                            * *       *                                                 
 230 | ****  ***   * *   ** *  *      *  *     * *  *               **                  *    *                            * *       *                                                 
 205 | ****  ***  ** *   ** *  *      *  *   * * ** *               **     *     *      *    *                            * *       *                                                 
 179 | ****  ***  ** * * ** *  *    * *  *   * * ** **       *      **     *   * *      *  * **                     *     * *       *                                                 
 153 | **** ****  ** * * ** * ** *  * *  *   * * ** **       *      **     *   * *      *  * **                     *     * *       *                                                 
 128 | **** ****  ** * **** * ** ** * *  **  * * ** **     * **     **     * * * *      *  * **             *       ** *  * *       *    *                                            
 102 |***** ****  ** * **** **** **** *  **  * **** **   * * ****   **   *** * * *  *   * ** **            **    *  ** *  * *       *    **                                           
  76 |**********  ** * **** ***********  **  ****** ** * *** ****   **** *** * * *  **  * ***** * *  *   * **    *  ** *  ***       **** **   **      *     *                         
  51 |*********** ********************* **** *************** **** ****** *** * * ** ** ********** *  * *** **  ***  ** *  ***  ********* ***  **   ** *     *  *                      
  25 |************************************** *************** *************** ********* ********** **** *********** ****** ******************  **  *** * **  * **   * ***** *          
   0 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 513 |                                                                                                                                                                               *
 487 |                                                                                                                                                                               *
 461 |                                                                                                                                                                               *
 436 |                                                                                                                                                                              **
 410 |                                                                                                                                                                            ****
 384 |                                                                                                                                                                            ****
 359 |                                                                                                                                                                            ****
 333 |                                                                                                                                                                       *********
 307 |                                                                                                                                                                  **************
 282 |                                                                                                                                                              ******************
 256 |                                                                                                                                                        ************************
 230 |                                                                                                                                                       *************************
 205 |                                                                                                                                                  ******************************
 179 |                                                                                                                                          **************************************
 153 |                                                                                                                                       *****************************************
 128 |                                                                                                                             ***************************************************
 102 |                                                                                                               *****************************************************************
  76 |                                                                                         ***************************************************************************************
  51 |                                                             *******************************************************************************************************************
  25 |                                ************************************************************************************************************************************************
   0 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

As an example using page_id instead of page_random, the distribution is *much* worse. ;)

 13962 |*                                                                                                                                                                                
 13263 |*                                                                                                                                                                                
 12565 |*                                                                                                                                                                                
 11867 |*                                                                                                                                                                                
 11169 |*                                                                                                                                                                                
 10471 |*                                                                                                                                                                                
  9773 |*                                                                                                                                                                                
  9075 |*                                                                                                                                                                                
  8377 |*                                                                                                                                                                                
  7679 |*                                                                                                                                                                                
  6981 |*                                                                                                                                                                                
  6282 |*                                                                                                                                                                                
  5584 |*                                                                                                                                                                                
  4886 |*                                                                                                                                                                                
  4188 |*                                                                                                                                                                                
  3490 |*                                                                                                                                                                                
  2792 |*                                                                                                                                                                                
  2094 |*                                                                                                                                                                                
  1396 |*                                                                                                                                                                                
   698 |**                                                                                                                                                                               
     0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 13962 |                                                                                                                                                                                *
 13263 |                                                                                                                                                                                *
 12565 |                                                                                                                                                                                *
 11867 |                                                                                                                                                                                *
 11169 |                                                                                                                                                                                *
 10471 |                                                                                                                                                                                *
  9773 |                                                                                                                                                                                *
  9075 |                                                                                                                                                                                *
  8377 |                                                                                                                                                                                *
  7679 |                                                                                                                                                                                *
  6981 |                                                                                                                                                                                *
  6282 |                                                                                                                                                                                *
  5584 |                                                                                                                                                                                *
  4886 |                                                                                                                                                                                *
  4188 |                                                                                                                                                                                *
  3490 |                                                                                                                                                                                *
  2792 |                                                                                                                                                                                *
  2094 |                                                                                                                                                                                *
  1396 |                                                                                                                                                                                *
   698 |                                                                                                                                                                               **
     0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

But we could instead keep picking random page_id values until we hit a valid page (matching the namespace and not-a-redirect criteria). This gives us a much more even distribution:

 84 |                                                                                                                                                          *                      
 79 |                                                                        *                                                          *                      *                      
 75 |      *                                      *    *        *     *      *                    *     *                        *      *                  *   *                      
 71 |      *                *         *           *    *     *  *   * **     *            *       *   * *                        ** *   *       *     *    *   *                   *  
 67 | * *  **               **       ***        * **   *** * * ** * * ***    *     *    ***   *   **  * *  * *      * *    * *   ****   *       *    **    * * *                * **  
 63 | * ** ** * * * *    ***** *  * *****    * ** ***  *** * * ** ********   *  *  * * ****  **   ***** ** * *   *  * *    * * * **** * *       *    **  * * * *  *** **        * *** 
 58 |** ********* ****** ******** ******** *** ** *** ****** * ************ *** * ************* ******* ** *******  *****  *** * **** **** ** *** ** ** **** * *  *** **  **    ***** 
 54 |************ ****** ********************* ****** ****** ****************** *********************** ** ************************** ***************** ****** * *******  ***** ***** 
 50 |************************************************************************** *********************************************************************** ***************** ***** ***** 
 46 |************************************************************************************************************************************************************************** ******
 42 |*********************************************************************************************************************************************************************************
 37 |*********************************************************************************************************************************************************************************
 33 |*********************************************************************************************************************************************************************************
 29 |*********************************************************************************************************************************************************************************
 25 |*********************************************************************************************************************************************************************************
 21 |*********************************************************************************************************************************************************************************
 16 |*********************************************************************************************************************************************************************************
 12 |*********************************************************************************************************************************************************************************
  8 |*********************************************************************************************************************************************************************************
  4 |*********************************************************************************************************************************************************************************
  0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 84 |                                                                                                                                                                                *
 79 |                                                                                                                                                                              ***
 75 |                                                                                                                                                                     ************
 71 |                                                                                                                                                         ************************
 67 |                                                                                                                           ******************************************************
 63 |                                                                                        *****************************************************************************************
 58 |                                        *****************************************************************************************************************************************
 54 |                *****************************************************************************************************************************************************************
 50 |     ****************************************************************************************************************************************************************************
 46 | ********************************************************************************************************************************************************************************
 42 |*********************************************************************************************************************************************************************************
 37 |*********************************************************************************************************************************************************************************
 33 |*********************************************************************************************************************************************************************************
 29 |*********************************************************************************************************************************************************************************
 25 |*********************************************************************************************************************************************************************************
 21 |*********************************************************************************************************************************************************************************
 16 |*********************************************************************************************************************************************************************************
 12 |*********************************************************************************************************************************************************************************
  8 |*********************************************************************************************************************************************************************************
  4 |*********************************************************************************************************************************************************************************
  0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The downside is that we might have to make a *lot* of dips into the pool, especially if we’re searching a minority namespace.

On the move

This month’s been a busy one… at Wikimedia we’ve done a lot of hopping around getting things together for the fundraiser, and my lady and I spent a few days out west scouting apartments in San Francisco for the relocation. I’m finally caught up with MediaWiki code review, and some dreaded coughing plague from the trip is catching up with me, so it’s time to hit the posting backlog… :D

Found a nice little flat that’s merely twice the cost of the places we were eying in Tampa… believe it or not that is a good deal for SF. :) We’re going to try compensating for part of the cost by going car-free.

Public transit in the city is pretty decent — especially compared to suburban Florida. Most places in the city are within a mile of a light rail line, and buses are both plentiful and frequent.

Further, San Francisco is a fairly car-hostile city. The hills make driving less than fun in many neighborhoods, and parking costs are atrocious. We were in town during municipal elections and watched the city’s voters reject a ballot measure to increase the amount of parking, in favor of allocating more funds for public transit. Even just parking at home would cost us $100 a month to rent a space in the carport out back!

The savings on insurance, parking, maintenance, and payments on a car whose transmission won’t die in the hills should more than pay for the occasional rental for trips out of town. Depending on how much we end up driving, we may actually be saving money versus living in the suburbs… we’d have cheaper rents out in Walnut Creek or Pleasant Hill, but more need to keep a car for everyday tasks.