Vaporware on vaporware… Wikipedia local search on Android?

android-wikipedia.jpgWhile Google’s Android mobile platform is still vaporware insofar as there’s no products in peoples’ hands yet, there is an SDK already out, and under much freer terms than the iPhone’s. ;)

Like the iPhone, Android includes a very capable WebKit-based browser. I’ve updated our HawPedia-based mobile gateway to recognize both the iPhone SDK emulator and the and Android’s browser, so you get properly ‘mobile-sized’ output on them instead of it thinking they’re “desktop” browsers and wrapping the page with a simulated cell phone image….

Unlike the iPhone, Android apps will potentially run on a wide variety of devices with different capabilities… but for those able to determine their physical location, there will be a standard API for location-based services, so it should be possible to make an Android version of our yet-to-be-finalized location-based Wikipedia search as well. Neat!

Thumbs in Wikimedia Commons search

I’ve added thumbs for ‘Image:’ page hits on MediaWiki’s search results output.

This is now live on Wikimedia Commons, testing migration from our old LuceneSearch extension (which replaced the entire Special:Search page) to the MWSearch extension I hacked together a long time ago but never quite deployed… this uses MediaWiki’s own Special:Search front-end, providing a plugin to replace only the backend.

Using a common front-end will allow us to more easily maintain and improve the UI on the front-end without having to replicate functionality between at least two different implementations; and using the same front-end for Wikimedia sites and the default MediaWiki installation will improve the experience for all those third-party users.

Everybody wins!

Wikipedia local search?

iphone-search.png

So, the other day Apple finally got round to releasing the iPhone SDK… While there may still be issues with the licensing and distribution, I’m at least intrigued about the possibilities of location-based searches — using the device’s knowledge of its physical location to search for articles about nearby places.

For kicks, I spent a few hours of my spare time prototyping a location-based Wikipedia search, with the interface optimized for a device like the iPhone.

I’m using a copy of the Wikipedia-World database, which has pulled coordinate data out of articles and linked them up via interwiki links. Results within an approximately ~20-30km range are sorted by distance and optionally filtered by text, then spit out in a list with little thumbnails.

Links are currently to the experimental WAP gateway, which tends to load rather faster over a slow connection and has nice big text. ;)

A small iPhone native app could pull up the device’s location, then pass it off to a web application like this (or present a similar list in native UI).

This does raise some questions to consider…

  • What are the privacy implications of getting people’s physical locations in addition to just their IP addresses? Would we have to update our privacy policy? How should we log or aggregate data for debugging and statistics purposes?
  • Do Wikimedia’s open-source software policies conflict with the various restrictions on software distribution for locked-down devices like the iPhone?

Update 2008-03-21: While we’re still waiting for Apple to let us into the beta program for official development, some enterprising individual has made an unofficial app for this which should install on jailbroken phones.

Google Summer of Code comin’ up!

It’s that time of year again… organization applications for Google’s Summer of Code are open.

The last couple of years we’ve had limited success with the SoC, in part because we’ve been so shorthanded on mentors that we can’t support more than one or two students. I’m looking for a few MediaWiki hackers who’d like to help out this time around…

You’ll need to be reasonably available by e-mail and IRC, and able to help answer the student’s questions and review their progress.

Part of the fun of Summer of Code projects is that we can get somebody excited and involved by working on something that’s big enough that it hasn’t got done yet, but small enough that they can make real progress and hopefully get something in production over the course of a couple months.

The real important part is making sure they feel welcome, and are excited about continuing their involvement in MediaWiki development after they’re done… so let’s make everybody feel at home!

Super Tuesday!

California’s primary election comes up in the morning, as are those of a buttload of other states. These combine selections of the various per-party presidential candidates in preparation for the November election, as well as various vital local and state ballot measures — parks, cops, and of course Indian gaming agreements.

Unlike everybody else with a blog, I’m not going to presume to tell y’all who to vote for. :)

But I have to admit I’ve been pleasantly surprised poking about Obama’s web site. I stumbled on this speech he gave on religion in politics, which is probably the first thing a mainstream American politician’s said about religion that hasn’t made me cringe and want to run away to Canada.

Fun election fact: California has a “modified open primary“, allowing voters who haven’t registered a party affiliation to cast their votes in the primary nomination process for a party of their choice… but only among those parties which have opted into it. We briefly had a completely open primary (so you could pick *any* party), but this got shut down on constitutional issues. Currently only the American Independent and Democratic parties are opted in to the system.

Wikipedia WAP portal updated

We’ve got a semi-experimental mobile portal for Wikipedia, based on the Hawpedia code using Hawhaw, that’s been up for a while.

I’ve updated it to the current version of the code, which seems to handle some templates better, as well as producing proper output for iPhone viewing. :)

Today’s fancy phones with their fancy browsers (the iPhone, Opera Mini, etc) can do a pretty good job handling the “real web” in addition to the stripped-down limited “mobile web” of yesteryear, but there are different pressures, which one should take into account when targeting mobile devices.

Screens are small, bandwidth is low. Wikipedia articles tend to be very long and thorough, but often all you need for an off-the-cuff lookup is the first couple paragraphs. The WAP gateway splits pages into shorter chunks, so you don’t have to wait to download the entire rest of the page (or wait for the slow phone CPU to lay it out).

Even on an iPhone capable of rendering the whole article and the MonoBook skin in all its glory, I find there’s some strong benefits to a shorter, cleaner page to do quick lookups on the go. (Especially if I’m not on Wifi!)

The biggest problem with the Hawpedia gateway today is that it tries to do its own hacky little wiki text parser, which dies horribly at times. Many pages look fine, but others end up with massive template breakage and become unreadable.

Long-term it may be better to do this translation at a higher level, working from the output XHTML… or else in an intermediate stage of MediaWiki’s own parser, with more semantic information still available.

Case-insensitive OpenSearch

I did some refactoring yesterday on the title prefix search suggestion backend, and added case-insensitive support as an extension.

The prefix search suggestions are currently used in a couple of less-visible places: the OpenSearch API interface, and the (disabled) AJAX search option.

The OpenSearch API can be used by various third-party tools, including the search bar in Firefox — in fact Wikipedia will be included by default as a search engine option in Firefox 3.0.

I’m also now using it to power the Wikipedia search backend for Apple’s Dictionary application in Mac OS X 10.5.

We currently have the built-in AJAX search disabled on Wikimedia sites in part because the UI is a bit unusual, but it’d be great to have more nicely integrated as a drop-down into various places where you might be inputting page titles.

The new default backend code is in the PrefixIndex class, which is now shared between the OpenSearch and AJAX search front-ends. This, like the previous code, is case-sensitive, using the existing title indexes. I’ve also got them now both handling the Special: namespace (which only AJAX search did previously) and returning results from the start of a namespace once you’ve typed as far as “User:” or “Image:” etc.

More excitingly, it’s now easy to swap out this backend with an extension by handling the PrefixSearchBackend hook.

I’ve made an implementation of this in the TitleKey extension, which maintains a table with a case-folded index to allow case-insensitive lookups. This lets you type in for instance “mother ther” and get results for “Mother Theresa”.

In the future we’ll probably want to power this backend at Wikimedia sites from the Lucene search server, which I believe is getting prefix support re-added in enhanced form.

We might also consider merging the case-insensitive key field directly into the page table, but the separate table was quicker to deploy, and will be easier to scrap if/when we change it. :)

MediaWiki security bump

Did a security release yesterday: MediaWiki 1.11.1, 1.10.3, and 1.9.5. I noticed that some of the output formats for api.php are susceptible to HTML injection through a longstanding problem with Internet Explorer’s content-type autodetection.

We’ve had protection against this in action=raw mode for years, but it didn’t make it into the API as nobody had quite noticed that some output formats were vulnerable. JSON and XML-based formats aren’t, but PHP serialization and YAML don’t escape strings as much as we might like.

If the format lets you pass some raw HTML tags, and you can stick an additional fake path after the script name in the URL (as allowed by most configurations), MSIE opens up a big XSS hole on your site.

Path components in URLs are supposed to be opaque; the HTTP content-type header is the only thing that’s supposed to specify what kind of resource you’re loading. Microsoft thinks it knows better, though — if it recognizes one of several pre-defined “extension”s at the end of the “filename” on the URL, it sniffs the file’s actual content to try to determine the file type. If it sees certain HTML tags, it’ll interpret it as HTML — even for valid GIF and PNG files!

(Rumor is that last hole has finally been fixed in recent Windows security updates; GIF and PNG headers will override the HTML detection. I haven’t tested to confirm this though.)

For “raw” and “API” type stuff where we have to pass through user data, we can protect against the autodetection by ensuring the URL hasn’t been tampered with. Having both an unrecognized URL and an unrecognized content-type keeps the content sniffy away… That’s why you currently get a ‘403 – Forbidden’ if you just toss ?action=raw on the end of a page URL.