Wikipedia local search?

iphone-search.png

So, the other day Apple finally got round to releasing the iPhone SDK… While there may still be issues with the licensing and distribution, I’m at least intrigued about the possibilities of location-based searches — using the device’s knowledge of its physical location to search for articles about nearby places.

For kicks, I spent a few hours of my spare time prototyping a location-based Wikipedia search, with the interface optimized for a device like the iPhone.

I’m using a copy of the Wikipedia-World database, which has pulled coordinate data out of articles and linked them up via interwiki links. Results within an approximately ~20-30km range are sorted by distance and optionally filtered by text, then spit out in a list with little thumbnails.

Links are currently to the experimental WAP gateway, which tends to load rather faster over a slow connection and has nice big text. ;)

A small iPhone native app could pull up the device’s location, then pass it off to a web application like this (or present a similar list in native UI).

This does raise some questions to consider…

  • What are the privacy implications of getting people’s physical locations in addition to just their IP addresses? Would we have to update our privacy policy? How should we log or aggregate data for debugging and statistics purposes?
  • Do Wikimedia’s open-source software policies conflict with the various restrictions on software distribution for locked-down devices like the iPhone?

Update 2008-03-21: While we’re still waiting for Apple to let us into the beta program for official development, some enterprising individual has made an unofficial app for this which should install on jailbroken phones.

Google Summer of Code comin’ up!

It’s that time of year again… organization applications for Google’s Summer of Code are open.

The last couple of years we’ve had limited success with the SoC, in part because we’ve been so shorthanded on mentors that we can’t support more than one or two students. I’m looking for a few MediaWiki hackers who’d like to help out this time around…

You’ll need to be reasonably available by e-mail and IRC, and able to help answer the student’s questions and review their progress.

Part of the fun of Summer of Code projects is that we can get somebody excited and involved by working on something that’s big enough that it hasn’t got done yet, but small enough that they can make real progress and hopefully get something in production over the course of a couple months.

The real important part is making sure they feel welcome, and are excited about continuing their involvement in MediaWiki development after they’re done… so let’s make everybody feel at home!

Wikipedia WAP portal updated

We’ve got a semi-experimental mobile portal for Wikipedia, based on the Hawpedia code using Hawhaw, that’s been up for a while.

I’ve updated it to the current version of the code, which seems to handle some templates better, as well as producing proper output for iPhone viewing. :)

Today’s fancy phones with their fancy browsers (the iPhone, Opera Mini, etc) can do a pretty good job handling the “real web” in addition to the stripped-down limited “mobile web” of yesteryear, but there are different pressures, which one should take into account when targeting mobile devices.

Screens are small, bandwidth is low. Wikipedia articles tend to be very long and thorough, but often all you need for an off-the-cuff lookup is the first couple paragraphs. The WAP gateway splits pages into shorter chunks, so you don’t have to wait to download the entire rest of the page (or wait for the slow phone CPU to lay it out).

Even on an iPhone capable of rendering the whole article and the MonoBook skin in all its glory, I find there’s some strong benefits to a shorter, cleaner page to do quick lookups on the go. (Especially if I’m not on Wifi!)

The biggest problem with the Hawpedia gateway today is that it tries to do its own hacky little wiki text parser, which dies horribly at times. Many pages look fine, but others end up with massive template breakage and become unreadable.

Long-term it may be better to do this translation at a higher level, working from the output XHTML… or else in an intermediate stage of MediaWiki’s own parser, with more semantic information still available.

Case-insensitive OpenSearch

I did some refactoring yesterday on the title prefix search suggestion backend, and added case-insensitive support as an extension.

The prefix search suggestions are currently used in a couple of less-visible places: the OpenSearch API interface, and the (disabled) AJAX search option.

The OpenSearch API can be used by various third-party tools, including the search bar in Firefox — in fact Wikipedia will be included by default as a search engine option in Firefox 3.0.

I’m also now using it to power the Wikipedia search backend for Apple’s Dictionary application in Mac OS X 10.5.

We currently have the built-in AJAX search disabled on Wikimedia sites in part because the UI is a bit unusual, but it’d be great to have more nicely integrated as a drop-down into various places where you might be inputting page titles.

The new default backend code is in the PrefixIndex class, which is now shared between the OpenSearch and AJAX search front-ends. This, like the previous code, is case-sensitive, using the existing title indexes. I’ve also got them now both handling the Special: namespace (which only AJAX search did previously) and returning results from the start of a namespace once you’ve typed as far as “User:” or “Image:” etc.

More excitingly, it’s now easy to swap out this backend with an extension by handling the PrefixSearchBackend hook.

I’ve made an implementation of this in the TitleKey extension, which maintains a table with a case-folded index to allow case-insensitive lookups. This lets you type in for instance “mother ther” and get results for “Mother Theresa”.

In the future we’ll probably want to power this backend at Wikimedia sites from the Lucene search server, which I believe is getting prefix support re-added in enhanced form.

We might also consider merging the case-insensitive key field directly into the page table, but the separate table was quicker to deploy, and will be easier to scrap if/when we change it. :)

MediaWiki security bump

Did a security release yesterday: MediaWiki 1.11.1, 1.10.3, and 1.9.5. I noticed that some of the output formats for api.php are susceptible to HTML injection through a longstanding problem with Internet Explorer’s content-type autodetection.

We’ve had protection against this in action=raw mode for years, but it didn’t make it into the API as nobody had quite noticed that some output formats were vulnerable. JSON and XML-based formats aren’t, but PHP serialization and YAML don’t escape strings as much as we might like.

If the format lets you pass some raw HTML tags, and you can stick an additional fake path after the script name in the URL (as allowed by most configurations), MSIE opens up a big XSS hole on your site.

Path components in URLs are supposed to be opaque; the HTTP content-type header is the only thing that’s supposed to specify what kind of resource you’re loading. Microsoft thinks it knows better, though — if it recognizes one of several pre-defined “extension”s at the end of the “filename” on the URL, it sniffs the file’s actual content to try to determine the file type. If it sees certain HTML tags, it’ll interpret it as HTML — even for valid GIF and PNG files!

(Rumor is that last hole has finally been fixed in recent Windows security updates; GIF and PNG headers will override the HTML detection. I haven’t tested to confirm this though.)

For “raw” and “API” type stuff where we have to pass through user data, we can protect against the autodetection by ensuring the URL hasn’t been tampered with. Having both an unrecognized URL and an unrecognized content-type keeps the content sniffy away… That’s why you currently get a ‘403 – Forbidden’ if you just toss ?action=raw on the end of a page URL.

User-to-user mail SPF and privacy borkage

Per bug 12655

On our newer, Ubuntu-based Apache configuration we’ve been using sSMTP as a minimal local SMTP sending agent. This emulates the ‘sendmail’ binary and simply passes messages off to a hub server with no local queuing… but it’s not without its problems.

sSMTP forces the message’s ‘From’ header and the SMTP envelope sender address to be the same, which causes some problems for us when that ‘From’ address is a user’s offsite e-mail:

  • Servers which validate SPF records may reject the messages outright
  • In case of delivery problems, bounce messages will be sent back to the user, possibly including the recipient’s address which is supposed to be kept private.

As a workaround for such configurations I’ve introduced a config var $wgUserEmailUseReplyTo. When set, a wiki-specific address is used as ‘From’, and the user’s address is put in ‘Reply-To’.

This is uglier — you don’t see a clean ‘Sender’ column in your mail client — but mails will get through and private data won’t get tossed around inappropriately.

In the long term I’d like to see us either dump sSMTP (a local-only postfix or something would work fine) or patch it to let the envelope sender be set separately.

Mobile MediaWiki

I’d like to revive some interest in improving support for mobile browsers.

Extremely limited WAP-based browsers are at least sort of served by the experimental WAP gateway, but there are a lot of smartphones and other handheld devices that get on the “real” web with greater or lesser degrees of success, and I’d like to see us improve the default look & feel of MediaWiki on them.

At the moment I think we can roughly divide the mobile browsers into two categories:

  • Those that render much like a full desktop browser and let you zoom as necessary (iPhone/Mobile Safari, Opera Mini, …?)
  • Those that have very limited CSS and JavaScript or strip a lot of stuff down (Opera Mini in “mobile view” mode, most others?)

At the moment, all I’ve got access to are an iPhone and the Opera Mini simulator applet, so that’s what I’ll be putting the occasional bit of time into. These already pretty much “just work”, but the UI can be very awkward due to the desktop-size layout; I’d like a cleaner handheld stylesheet that lets most pages be legible when you get to them.

If you’ve got another device and you’d like to help testing and developing for it, please stake your claim.

Alternatively if you’ve got a spare device you can donate to us, that’d be great too! (Especially if it doesn’t need a service subscription to get on the net…)

Wiki data dumps restarted

Maintenance is still pending on the old dump server… I’ve moved the files over to storage2, one of our backup servers, and restarted a couple of dump worker threads. Currently one of those is running on the old server, but it won’t be too fatal if it dies for now. :)