wiki – Page 21

Mailman sucks

GNU Mailman kinda sucks, but like democracy I’ve yet to come across something better. ;)

A few things I’d really like to see improved:

Web archive links in footer
I find I fairly commonly read something on a list and then want to discuss it with other folks in chat. To point them at the same message I was reading, I have to pull up the web archives, then poke around to find it, then copy the link.

In an ideal world, the message footer could include a link to the same message on the web archives, and I could just copy and paste.
Thread moderation
It should be easy to place an out-of-control thread on moderation. I’ll be honest, I can’t figure out how to do it right now. There’s _spam_ filtering, but we discard that. There’s whole-list moderation. There’s per-user moderation. But how do you moderate a particular thread?
Throttling
In some cases, a simple time-delay throttle can help calm things down without actually forcing a moderator to sit there and approve messages. It can feel “fairer” too, since you’re not singling out That One Guy Who Keeps Posting In That One Thread.
Easy archive excision
On public mailing lists, sometimes people post private information accidentally (phone numbers in the signature, private follow-up accidentally sent to list, etc) which they then ask to be removed from the archives. People can understandably get a bit worked up over privacy issues, particularly when the Google-juice of a wikimedia.org domain bumps the message to the first Google hit for their name. ;)

Unfortunately it’s a huge pain in the ass to excise a message from the archives in mailman. You have to shut down the mailing list service, edit a multi-megabyte text file with all that month’s messages, carefully so you don’t disturb the message numbering, rebuild the HTML archive pages from the raw files, and then, finally, restart the mailman service. That’s a lot of manual work and some service outage balanced against having people scream at you, arguably justifiably, and for now we’ve ended up simply disabling crawler access to the archives to keep them out of high-ranked global search indexes. (I know you disagree with this, Timwi. It’s okay, we still love you!)

If we could strike a message from the archives in a one-touch operation, the way we can unsubscribe someone who can’t figure out the unsubscribe directions, we could switch those crawlers back on and make it easier to search the list.
Integrated archive search
We’ve experimented with the htdig integration patch, but the search results are not terribly good and the indexing performance is too slow on our large archives. Even if we get Google etc going again, it’d be nice to have an integrated search that’s a little more polished.

Distributed time zones

Working with a distributed team, such as Wikimedia’s tech team, has its advantages and disadvantages. One irksome, yet useful aspect of that is the different time zones that people live in.

In the early days, our time zone distribution looked roughly like this:

Classic Wikipedia admin timezones

With Tim in Australia, Mark and others in Europe, and me in California, our timezones were nearly evenly spaced. If we all worked the same hours (local 9-to-5s for June are marked above), we’d almost never be online at the same time. Of course we all worked irregular hours, so there tended to be some overlap.

For most of 2007 though we’ve had something more like this:

Compressed timezones

Tim moved to England, I moved to Florida, and suddenly our time zones are much more compressed, with a much larger overlap.

On the one hand this is nice — we have more “face time” for real-time interaction in the chat channels.

On the other hand this leaves a big portion in the day when none of the core tech team is “on duty”, which reduces our ability to respond quickly to crises. Luckily we’ve had a lot fewer problems this year since we’ve gotten a lot of old problems fixed up and our hardware capacity has generally stayed at or ahead of the growth curve.

For 2008 it looks like we’ll be going back to a more spread out team:

New timezones

Tim’s moving back to Australia, and I’ll be heading back to California when the Wikimedia Foundation sets up its new offices in the San Francisco bay area. We’ll also have Rob still active with the servers in Tampa, filling in some holes in coverage in the middle.

There’s some concern that this’ll reduce our ability to work directly with each other by IRC, but that’s not necessarily a bad thing. Relying too much on chat introduces problems of its own:

Those who aren’t available online constantly get marginalized…

When important decisions are made in chat, you don’t get to participate if you dare to sleep, have a day job, go to class, have a life… :)
Records are poorer compared with a mailing list or wiki — not only did you miss the boat, you don’t get to see what the boat looked like. You may not even know there was a boat…

We try to combat this by keeping a detailed server admin log and announcing details of big outages or updates on the lists.

Putting more emphasis on mailing list and wiki communication could make it easier to embrace new developers who can’t all be online at the same time… and paying more attention to our own wikis might help with dogfooding. ;)

Updated: Corrected Melbourne to Sydney in 2008 time zone map.

So you wanna be a MediaWiki coder?

Some easy bugs to cut your teeth on…

Bug 1600 – clean up accidental == header markup == in new sections. (Note — there’s an unrelated patch which got posted on this bug by mistake ages ago, just ignore it. :)
Bug 11389 – current diff views probably should clear watchlist update notifications generally, as they do for talk page notifications.
Bug 11380 – the ‘Go’ search shortcut needs some namespace option lovin’…

Or maybe you’re prefer to clean up an old patch and get it ready to go?

Bug 900 – Fix category column spacing. Since letter headers take up more space than individual lines, we get oddly balanced columns if some letters are better represented than others…
Age: 2 years, 7 months. Ouch! :)
Patch status: Applied with only minor cleanup, this function hasn’t changed much! There seems to be something wrong with the algorithm here; while it seems to balance a bit better, I see items dropped off the end of the list sometimes. Needs more work.
Bug 1433 – HTML meta info for interlanguage links.
Age: Two years, seven months.
Patch status: Applied after minor changes, but doesn’t seem compatible. Provided an alternate version which seems to work with SeaMonkey and Lynx. Is this an appropriate thing, and how do we i18nize the link text?

JUGgling

Gave a little talk on Wikipedia’s scalability architecture at GatorJUG in Gainesville the other day… I should probably blog these things before they happen, right?

Will also be at OrlandoJUG in a couple weeks; Thursday, September 27. Whee!

rsync 3.0 crashy :( [fixed!]

rsync 3.0 may rock, but it’s also kinda crashy. :(

It’s died a couple of times while syncing up our ~2TB file upload backup. I’ve attached some gdb processes to try and at least get some backtraces next time it goes.

Update: Got a backtrace…

#0  0x00002b145663886b in free () from /lib/libc.so.6
#1  0x000000000043277b in pool_free_old (p=0x578730, addr=) at lib/pool_alloc.c:266
#2  0x0000000000404374 in flist_free (flist=0x89e960) at flist.c:2239
#3  0x000000000040ddbc in check_for_finished_files (itemizing=1, code=FLOG, check_redo=0) at generator.c:1851
#4  0x000000000040e189 in generate_files (f_out=4, local_name=) at generator.c:1960
#5  0x000000000041753c in do_recv (f_in=5, f_out=4, local_name=0x0) at main.c:774
#6  0x00000000004177ac in client_run (f_in=5, f_out=, pid=25539, argc=, 
    argv=0x56cf98) at main.c:1021
#7  0x0000000000418706 in main (argc=2, argv=0x56cf90) at main.c:1185

Looks like others may have seen it in the wild but a fix doesn’t seem to be around yet. Some sort of bug in the extent allocation pool freeing changes done in May 2007, I think.

Found and patched a probably unrelated bug in pool_alloc.c.

Further updated next day:

The crashy bug should be fixed now. Yay!

PHP 5.2.4 error reporting changes

Noticed a couple neat bits combing through the changlogs for the PHP 5.2.4 release candidate…

Changed “display_errors” php.ini option to accept “stderr” as value whichmakes the error messages to be outputted to STDERR instead of STDOUT with CGI and CLI SAPIs (FR #22839). (Jani)

This warms the cockles of my heart! We do a lot of command-line maintenance scripts for MediaWiki, and it’s rather annoying to have error output spew to stdout by default. Being able to direct it to stderr, where it won’t interfere with the main output stream, should be very nice.

Changed error handler to send HTTP 500 instead of blank page on PHP errors. (Dmitry, Andrei Nigmatulin)

This in theory should give nicer results for when the software appears to *just die* for no reason — with display_errors off, if you hit a PHP fatal error the code just stops and nothing else gets output. In an app that does its processing before any output, the result is a blank page with no cues to the user as to what happened.

Unfortunately it looks like it’s only going to be a help to machine processing, and even then only for the blank-page case. :(

In my quick testing, I only get the 500 error when there was no output done… and it *still* returns blank output, it just comes with the 500 result code.

The plus side is this should keep blank errors out of Google and other search indexes; the minus side is it won’t help with fatal errors that come in the middle of output, or the rather common case of sites which leave display_errors on… because then the error message gets output, so you don’t get a 500 result code.

rsync 3.0 rocks!

Wikimedia’s public image and media file uploads archive has been growing and growing and growing over the years, nowadays easily eating 1.5 TB or so.

This has made it harder to provide publicly downloadable copies, as well as to maintain our own internal backup copies — and not having backups in a hurricane zone is considered bad form.

In the terabyte range, making a giant tar archive is kind of… difficult. Not only is it insanely hard to download the whole thing if you want it, but it multiplies our space requirements — you need space for every complete and variant archive as well as all the original files. Plus it just takes forever to make them.

rsync seems like a natural fit for updating and then synchronizing a large set of medium-size files, but as we’ve grown it became slower and slower and slower.

The problem was that rsync worked in two steps:

First, it scanned the directory trees to list all the files it might have to transfer.

Then, once that was done, it transferred the ones which needed to be updated.

Great for a few thousand files — rotten for a few million! On an internal test, I found that after a full day the rsync process was still transferring nothing but filenames — and its in-memory file list was using 2.6 *gigabytes* of RAM.

Not ideal. :)

Searching around, I stumbled upon the interesting fact that the upcoming rsync 3.0 does “incremental recursion” — that is, it does that little “list, then transfer” cycle for each individual directory instead of for the entire file set at once.

I grabbed a development tree from CVS, compiled it up, and gave it a try — within seconds I started to see files actually being transferred.

Hallelujah! rsync works again…

We’re now in process of getting our internal backup synced up again, and will see about getting an off-site backup and maybe a public rsync 3.0 server set up.

Some installer tasks

Been poking about at MediaWiki, but not sure what to do? Here’s a few tasks that would help with some common problems for third-party users:

1379: Install can’t find config/index.php
Some hosting services put a “control panel” of some sort at the “/config” URL, making it difficult to get at the MediaWiki installer. Renaming this to something more unique, and providing a compatibility link for convenience, would not be very hard but would help people stuck on this sort of host.
9954: Detect “extra whitespace” / BOM conditions
PHP is very picky about extra whitespace at the start and end of source files. Unfortunately it’s not uncommon for people to end up with extra blank lines or a hidden Unicode BOM sequence at the start or end of files they’ve customized. This leads to weird, hard to diagnose problems like cookies not getting set or RSS feeds that break with little explanation. Some software support to detect this and report which file is broken (and how to fix it!) would be very helpful.
~~10387: Detect and handle ‘.php5’ extension environments~~
More and more hosting services are providing PHP 5.x, but some are putting it alongside existing PHP 4.x services, requiring that files be named with a .php5 extension. With a little care, the installer could detect this out of the box and set things up to work on such systems with few problems.
Update 2007-06-28: Edward Z. Yang whipped up a good patch for this, which I’ve commited to trunk.

Wikimedia page views

For the curious, here’s some statistics I whipped up the other day from our sampled logs. These are counts of plain article page views on all of Wikimedia’s wikis, from about three days’ worth of a 1/1000-sampled log. Images, style sheets, edit actions, diffs, special pages, etc are excluded.

I’ve broken it down by referrer, with some rough groupings. It’s not very scientific, but some might find it interesting. :)

We should be able to process similar statistics on an automated basis from the log server now that we’ve got it set up, including breakdowns by site and by language.

Referrer	Samples	Percentage	Daily extrapolation	Monthly extrapolation
total	71873	100.0000	233,610,232	7,008,306,975
in-wiki link	31512	43.8440	102,424,076	3,072,722,293
Google	17102	23.7947	55,586,969	1,667,609,059
no referrer	11873	16.5194	38,591,047	1,157,731,397
IE7 gadget	3862	5.3734	12,552,735	376,582,048
other Wikimedia	2174	3.0248	7,066,195	211,985,855
in-wiki search	2140	2.9775	6,955,684	208,670,529
Yahoo	1507	2.0968	4,898,232	146,946,957
other	1162	1.6167	3,776,872	113,306,147
MSN	208	0.2894	676,067	20,281,995
Live	189	0.2630	614,310	18,429,313
AOL	84	0.1169	273,027	8,190,806
Ask	39	0.0543	126,762	3,802,874
AltaVista	21	0.0292	68,257	2,047,701

Wikimedia in Google Summer of Code

Wikimedia’s been accepted as a mentoring organization for the 2007 Google Summer of Code program.

Here’s our organization page, and I put up an initial project list on meta.

The list is semi-protected so it won’t be too vandalized ;) but additional suggestions are welcome. I’d like to ask that people who aren’t directly involved in development not add too much to the main page directly, though; last year we ended up with lots of project submissions for things that weren’t really considered high priority, so I’d like to keep the list a little more ordered this time.

We don’t know for sure how many projects we’ll get assigned, so we’ll see. :) At least Tim and I will serve as mentors for the student projects; if a couple more experienced developers would like to help out with that too that would be super.

Last year’s projects went really well up to the public demo stage but never quite got integrated into the mainline; I’m hoping that this year we can stick with projects that will be easier to slip in and take live much earlier in the process, which should help keep the students interested and the projects active.