We used to have a lot of complaints about the random page feature on Wikipedia/MediaWiki picking some pages *very* frequently. After various tweaking it got improved enough that we don’t hear much about it, but it’s still not very even. To get an idea of how big the problem remains, I did a little testing of the distribution of selections on my local test database, which is pretty fragmented and weird with various imports and test pages and mass deletions in its history. :)
I did these tests by grabbing random page_id values, with enough runs to select each page 100 times given a perfectly even distribution.
The current system uses a special index field, page_random. The field contains a random number between 0 and 1.0 for each page; when selecting pages, we pick a random float and have the database grab the first valid page with an index greater than or equal to that number. Ideally, the distribution of page_random values would be perfect — certainly it’s better than the page_id distribution! — and we should get fairly even selections.
But it’s not perfect, as there are going to be gaps of differing sizes between entries, and entries with large gaps before them will be predisposed to get more hits. In my test db, the most-frequently picked pages are five times as likely to be selected as ideal, and other pages are very unlikely to come up.
(The first graph is sorted by page_id; the second by hits.)
513 | * 487 | * 461 | * 436 | * * 410 | ** * * 384 | ** * * 359 | ** * * 333 | **** * * * * * 307 | **** ** * * * * * * * * 282 | **** *** * * * * * * * ** * * 256 | **** *** * * ** * * * * * * ** * * * * * 230 | **** *** * * ** * * * * * * * ** * * * * * 205 | **** *** ** * ** * * * * * * ** * ** * * * * * * * 179 | **** *** ** * * ** * * * * * * * ** ** * ** * * * * * ** * * * * 153 | **** **** ** * * ** * ** * * * * * * ** ** * ** * * * * * ** * * * * 128 | **** **** ** * **** * ** ** * * ** * * ** ** * ** ** * * * * * * ** * ** * * * * * 102 |***** **** ** * **** **** **** * ** * **** ** * * **** ** *** * * * * * ** ** ** * ** * * * * ** 76 |********** ** * **** *********** ** ****** ** * *** **** **** *** * * * ** * ***** * * * * ** * ** * *** **** ** ** * * 51 |*********** ********************* **** *************** **** ****** *** * * ** ** ********** * * *** ** *** ** * *** ********* *** ** ** * * * 25 |************************************** *************** *************** ********* ********** **** *********** ****** ****************** ** *** * ** * ** * ***** * 0 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
513 | * 487 | * 461 | * 436 | ** 410 | **** 384 | **** 359 | **** 333 | ********* 307 | ************** 282 | ****************** 256 | ************************ 230 | ************************* 205 | ****************************** 179 | ************************************** 153 | ***************************************** 128 | *************************************************** 102 | ***************************************************************** 76 | *************************************************************************************** 51 | ******************************************************************************************************************* 25 | ************************************************************************************************************************************************ 0 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
As an example using page_id instead of page_random, the distribution is *much* worse. ;)
13962 |* 13263 |* 12565 |* 11867 |* 11169 |* 10471 |* 9773 |* 9075 |* 8377 |* 7679 |* 6981 |* 6282 |* 5584 |* 4886 |* 4188 |* 3490 |* 2792 |* 2094 |* 1396 |* 698 |** 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
13962 | * 13263 | * 12565 | * 11867 | * 11169 | * 10471 | * 9773 | * 9075 | * 8377 | * 7679 | * 6981 | * 6282 | * 5584 | * 4886 | * 4188 | * 3490 | * 2792 | * 2094 | * 1396 | * 698 | ** 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
But we could instead keep picking random page_id values until we hit a valid page (matching the namespace and not-a-redirect criteria). This gives us a much more even distribution:
84 | * 79 | * * * 75 | * * * * * * * * * * * * 71 | * * * * * * * * ** * * * * * ** * * * * * * * 67 | * * ** ** *** * ** *** * * ** * * *** * * *** * ** * * * * * * * * **** * * ** * * * * ** 63 | * ** ** * * * * ***** * * ***** * ** *** *** * * ** ******** * * * * **** ** ***** ** * * * * * * * * **** * * * ** * * * * *** ** * *** 58 |** ********* ****** ******** ******** *** ** *** ****** * ************ *** * ************* ******* ** ******* ***** *** * **** **** ** *** ** ** **** * * *** ** ** ***** 54 |************ ****** ********************* ****** ****** ****************** *********************** ** ************************** ***************** ****** * ******* ***** ***** 50 |************************************************************************** *********************************************************************** ***************** ***** ***** 46 |************************************************************************************************************************************************************************** ****** 42 |********************************************************************************************************************************************************************************* 37 |********************************************************************************************************************************************************************************* 33 |********************************************************************************************************************************************************************************* 29 |********************************************************************************************************************************************************************************* 25 |********************************************************************************************************************************************************************************* 21 |********************************************************************************************************************************************************************************* 16 |********************************************************************************************************************************************************************************* 12 |********************************************************************************************************************************************************************************* 8 |********************************************************************************************************************************************************************************* 4 |********************************************************************************************************************************************************************************* 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
84 | * 79 | *** 75 | ************ 71 | ************************ 67 | ****************************************************** 63 | ***************************************************************************************** 58 | ***************************************************************************************************************************************** 54 | ***************************************************************************************************************************************************************** 50 | **************************************************************************************************************************************************************************** 46 | ******************************************************************************************************************************************************************************** 42 |********************************************************************************************************************************************************************************* 37 |********************************************************************************************************************************************************************************* 33 |********************************************************************************************************************************************************************************* 29 |********************************************************************************************************************************************************************************* 25 |********************************************************************************************************************************************************************************* 21 |********************************************************************************************************************************************************************************* 16 |********************************************************************************************************************************************************************************* 12 |********************************************************************************************************************************************************************************* 8 |********************************************************************************************************************************************************************************* 4 |********************************************************************************************************************************************************************************* 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The downside is that we might have to make a *lot* of dips into the pool, especially if we’re searching a minority namespace.