We used to have a lot of complaints about the random page feature on Wikipedia/MediaWiki picking some pages *very* frequently. After various tweaking it got improved enough that we don’t hear much about it, but it’s still not very even. To get an idea of how big the problem remains, I did a little testing of the distribution of selections on my local test database, which is pretty fragmented and weird with various imports and test pages and mass deletions in its history. :)
I did these tests by grabbing random page_id values, with enough runs to select each page 100 times given a perfectly even distribution.
The current system uses a special index field, page_random. The field contains a random number between 0 and 1.0 for each page; when selecting pages, we pick a random float and have the database grab the first valid page with an index greater than or equal to that number. Ideally, the distribution of page_random values would be perfect — certainly it’s better than the page_id distribution! — and we should get fairly even selections.
But it’s not perfect, as there are going to be gaps of differing sizes between entries, and entries with large gaps before them will be predisposed to get more hits. In my test db, the most-frequently picked pages are five times as likely to be selected as ideal, and other pages are very unlikely to come up.
(The first graph is sorted by page_id; the second by hits.)
513 | * 487 | * 461 | * 436 | * * 410 | ** * * 384 | ** * * 359 | ** * * 333 | **** * * * * * 307 | **** ** * * * * * * * * 282 | **** *** * * * * * * * ** * * 256 | **** *** * * ** * * * * * * ** * * * * * 230 | **** *** * * ** * * * * * * * ** * * * * * 205 | **** *** ** * ** * * * * * * ** * ** * * * * * * * 179 | **** *** ** * * ** * * * * * * * ** ** * ** * * * * * ** * * * * 153 | **** **** ** * * ** * ** * * * * * * ** ** * ** * * * * * ** * * * * 128 | **** **** ** * **** * ** ** * * ** * * ** ** * ** ** * * * * * * ** * ** * * * * * 102 |***** **** ** * **** **** **** * ** * **** ** * * **** ** *** * * * * * ** ** ** * ** * * * * ** 76 |********** ** * **** *********** ** ****** ** * *** **** **** *** * * * ** * ***** * * * * ** * ** * *** **** ** ** * * 51 |*********** ********************* **** *************** **** ****** *** * * ** ** ********** * * *** ** *** ** * *** ********* *** ** ** * * * 25 |************************************** *************** *************** ********* ********** **** *********** ****** ****************** ** *** * ** * ** * ***** * 0 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
513 | * 487 | * 461 | * 436 | ** 410 | **** 384 | **** 359 | **** 333 | ********* 307 | ************** 282 | ****************** 256 | ************************ 230 | ************************* 205 | ****************************** 179 | ************************************** 153 | ***************************************** 128 | *************************************************** 102 | ***************************************************************** 76 | *************************************************************************************** 51 | ******************************************************************************************************************* 25 | ************************************************************************************************************************************************ 0 +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
As an example using page_id instead of page_random, the distribution is *much* worse. ;)
13962 |* 13263 |* 12565 |* 11867 |* 11169 |* 10471 |* 9773 |* 9075 |* 8377 |* 7679 |* 6981 |* 6282 |* 5584 |* 4886 |* 4188 |* 3490 |* 2792 |* 2094 |* 1396 |* 698 |** 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
13962 | * 13263 | * 12565 | * 11867 | * 11169 | * 10471 | * 9773 | * 9075 | * 8377 | * 7679 | * 6981 | * 6282 | * 5584 | * 4886 | * 4188 | * 3490 | * 2792 | * 2094 | * 1396 | * 698 | ** 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
But we could instead keep picking random page_id values until we hit a valid page (matching the namespace and not-a-redirect criteria). This gives us a much more even distribution:
84 | * 79 | * * * 75 | * * * * * * * * * * * * 71 | * * * * * * * * ** * * * * * ** * * * * * * * 67 | * * ** ** *** * ** *** * * ** * * *** * * *** * ** * * * * * * * * **** * * ** * * * * ** 63 | * ** ** * * * * ***** * * ***** * ** *** *** * * ** ******** * * * * **** ** ***** ** * * * * * * * * **** * * * ** * * * * *** ** * *** 58 |** ********* ****** ******** ******** *** ** *** ****** * ************ *** * ************* ******* ** ******* ***** *** * **** **** ** *** ** ** **** * * *** ** ** ***** 54 |************ ****** ********************* ****** ****** ****************** *********************** ** ************************** ***************** ****** * ******* ***** ***** 50 |************************************************************************** *********************************************************************** ***************** ***** ***** 46 |************************************************************************************************************************************************************************** ****** 42 |********************************************************************************************************************************************************************************* 37 |********************************************************************************************************************************************************************************* 33 |********************************************************************************************************************************************************************************* 29 |********************************************************************************************************************************************************************************* 25 |********************************************************************************************************************************************************************************* 21 |********************************************************************************************************************************************************************************* 16 |********************************************************************************************************************************************************************************* 12 |********************************************************************************************************************************************************************************* 8 |********************************************************************************************************************************************************************************* 4 |********************************************************************************************************************************************************************************* 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
84 | * 79 | *** 75 | ************ 71 | ************************ 67 | ****************************************************** 63 | ***************************************************************************************** 58 | ***************************************************************************************************************************************** 54 | ***************************************************************************************************************************************************************** 50 | **************************************************************************************************************************************************************************** 46 | ******************************************************************************************************************************************************************************** 42 |********************************************************************************************************************************************************************************* 37 |********************************************************************************************************************************************************************************* 33 |********************************************************************************************************************************************************************************* 29 |********************************************************************************************************************************************************************************* 25 |********************************************************************************************************************************************************************************* 21 |********************************************************************************************************************************************************************************* 16 |********************************************************************************************************************************************************************************* 12 |********************************************************************************************************************************************************************************* 8 |********************************************************************************************************************************************************************************* 4 |********************************************************************************************************************************************************************************* 0 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The downside is that we might have to make a *lot* of dips into the pool, especially if we’re searching a minority namespace.
So how are the values in the random table assigned? Wouldn’t it be easy, just to assign new values (rebuild random table), and then all would be fine?
The values of page_random are assigned, well, randomly. ;) The problem isn’t that the *values* are bad, but that the technique will inherently give this kind of biased result.
You can visualize the problem by considering the pages not as randomly sorted points, but as randomly sized *spaces*. It’s really the gaps in front of the points, not the points, which are being selected.
The gaps between the points vary from large to small, depending on how close other page points are. Pages with larger spaces in front are more likely to be selected than those with smaller spaces in front of them… this leads to a biased selection, when what we really wanted was for each page to have an equal chance of selection.
An improved algo might grab a small list of pages over a random range, big enough to reduce the variance in those gap sizes, then pick evenly from that number.
The rebuild of the table column could be: value[row i] = i/N, where N is the total count of rows. How long would that update take? Is it at all practical for the larger wikis?
Very long, and no, not very practical. Ideally we want a system that doesn’t require a long, very expensive rebuild to be run periodically. :)