HTML to PDF, why so hard?

I’ve been testing out MediaWiki PDF export using PediaPress’s mwlib & mwlib.rl. This system uses a custom MediaWiki parser written in Python, which then calls out to a PDF generator library to assemble a pretty, printable PDF output file.

The PediaPress folks are responsive to bug reports, but in the long run I worry that this would be a difficult system to maintain. The alternate parser/renderer needs to reimplement not only MediaWiki’s core markup syntax, but support for every current and future parser or media format extension we roll out into production usage.

Something based on the XHTML we already generate would be the most future-proof export system. This could of course be HTML that’s geared specifically for print, say by including higher-resolution images and making use of vector versions of math and SVG more readily, among other things.

Ideally, we’d be able to use common open-source browser engines like Gecko or WebKit for this — engines we already know render our sites pretty well. Unfortunately there doesn’t yet seem to be a standard kit for using them to do headless print export.

I did some scouring around and found a few other HTML-to-PDF options, starting with those used by some MediaWiki extensions…

HTMLDoc

  • GPL/commercial dual-licence; C
  • Used by Pdf Book and Pdf Export extensions.
  • Seems to have absolutely ancient HTML support… no style sheets, no Asian text, etc…
  • Verdict: NO

dompdf

  • LGPL; PHP
  • Used by Pdf Export Dompdf extension.
  • DOM-based HTML & CSS to PDF converter written in PHP… Sounds relatively cute, but development seems to have fallen off in 2006 and support remains incomplete.
  • Verdict: NO

Googling about I stumbled upon some other fun…

Dynalivery Gecko

  • Commercial? Demo?
  • Online demo of an actual use of Gecko as an HTML-to-PDF print server! Seems to be some commercial thing, and the output quality indicates it’s a very old Gecko, with lots of printing bugs.
  • Neat to see it, though!
  • Verdict: NO

PrinceXML

  • Proprietary; server license $3800
  • Great quality and flexibility; this would be a great choice in the commercial world. :) They have some Wikipedia samples done with a custom two-column stylesheet which are quite attractive.
  • Not being open source, alas, is a killer here.
  • Verdict: NO

CSSToXSLFO

  • Public domain; Java
  • Converts XHTML+CSS2 to XSL-FO, which can then be rendered out to PDF using more open-source components. Seems under active development, last release in December 2007.
  • Might be pretty nice, but my last experience playing with XSL-FO via Apache FOP in 2005 or so was very painful, with lots of unsupported layout features.
  • Verdict: try me and see

LuceneSearch is dead, long live MWSearch!

I’ve made a couple more cleanup fixes to the core search UI behavior, so namespace selections work more consistently, and have gone ahead and switched it in as the sole search interface on all Wikimedia wikis.

This means the LuceneSearch extension is officially obsolete. The MWSearch extension provides a back-end plugin for MediaWiki’s core search user interface, and all further front-end work should be done in core where it’ll benefit everybody.

Note that many Wikimedia sites have put in local JavaScript hacks to add extra external search options to the form; unfortunately they have used particular form IDs specific to the old, obsolete extension.

I took the liberty of adapting the English Wikipedia’s JS to work with either case.

Please feel free to pass that fix on to other wikis.

First production <video> tag support lands… without Ogg support

So, Apple pushed out Safari 3.1 for Mac and Windows today, which adds support for the HTML 5 <video> tag… unfortunately, without native Ogg support. :(

Fortunately, it uses QuickTime as the backend, so if you have the XiphQT plugins installed, it will play Ogg Vorbis and Theora files. Yay!

Filed two three bugs for our video plugin detection on Safari…

Wikipedia search front-end updates

Ok, first things first: I’ve lifted the old restriction that disabled search context display for visitors who aren’t logged in. This makes searching much nicer!

The restriction was put in place in November 2005 as a temporary emergency measure to relieve pressure on the database servers. In the intermediate time, our capacity has caught up a lot better and switching it back in caused only a modest bump in visible server load.

Looking forward, I’ve set it up so you can test how the new search plugin will look on Wikipedia once we’ve fully enabled it:

The new plugin uses MediaWiki’s core search UI. Right now the only “cool” thing in it is the image thumbnail display, but as we continue to improve it it’s going to become a lot prettier for all MediaWiki users, not just for Wikipedia.

Update: The core search UI has been partially updated, and now more or less matches the style of the LuceneSearch plugin. Still some tweaks to do, and plenty of further improvements to make!

TODO: MediaWiki’s MySQL search backend

Some problems and solutions…

Problem 0: Wildcard searches don’t work

  • This was fixed in 1.12! You can search for “releas*” and match both “release” and “releasing” etc.

Problem 1: Minimum length and stopwords

  • People don’t like when their searches can’t turn up their favorite acronyms and such
    You can tweak the MySQL configuration… server-wide… if you have enough permissions on the server…

  • We can hack a transformation like we do for Unicode: append x00 or such to small words to force them to be indexed.

Problem 2: The table crashes sometimes

  • People often get mystified when the searchindex table is marked crashed.
  • Catch the error: try a REPAIR TABLE transparently, and display a friendlier error if that fails.

Problem 3: Separate title and text search results are ugly and hard to manage

  • People are used to Google-style searches where you just get one set of results which takes both title and body text into account.
  • Merge the title into the text index and return one set of results only.

Problem 4: Needs to join to ‘page’ table

  • The search does joins to the ‘page’ table to do namespace & redirect filtering and to return the original page title for result display. These joins can cause ugly slow locks, mixing up the InnoDB world with the MyISAM world.
  • Denormalize: add fields for namespace, original title, and redirect status to ‘searchindex’ table.

Vaporware on vaporware… Wikipedia local search on Android?

android-wikipedia.jpgWhile Google’s Android mobile platform is still vaporware insofar as there’s no products in peoples’ hands yet, there is an SDK already out, and under much freer terms than the iPhone’s. ;)

Like the iPhone, Android includes a very capable WebKit-based browser. I’ve updated our HawPedia-based mobile gateway to recognize both the iPhone SDK emulator and the and Android’s browser, so you get properly ‘mobile-sized’ output on them instead of it thinking they’re “desktop” browsers and wrapping the page with a simulated cell phone image….

Unlike the iPhone, Android apps will potentially run on a wide variety of devices with different capabilities… but for those able to determine their physical location, there will be a standard API for location-based services, so it should be possible to make an Android version of our yet-to-be-finalized location-based Wikipedia search as well. Neat!

Thumbs in Wikimedia Commons search

I’ve added thumbs for ‘Image:’ page hits on MediaWiki’s search results output.

This is now live on Wikimedia Commons, testing migration from our old LuceneSearch extension (which replaced the entire Special:Search page) to the MWSearch extension I hacked together a long time ago but never quite deployed… this uses MediaWiki’s own Special:Search front-end, providing a plugin to replace only the backend.

Using a common front-end will allow us to more easily maintain and improve the UI on the front-end without having to replicate functionality between at least two different implementations; and using the same front-end for Wikimedia sites and the default MediaWiki installation will improve the experience for all those third-party users.

Everybody wins!

Wikipedia local search?

iphone-search.png

So, the other day Apple finally got round to releasing the iPhone SDK… While there may still be issues with the licensing and distribution, I’m at least intrigued about the possibilities of location-based searches — using the device’s knowledge of its physical location to search for articles about nearby places.

For kicks, I spent a few hours of my spare time prototyping a location-based Wikipedia search, with the interface optimized for a device like the iPhone.

I’m using a copy of the Wikipedia-World database, which has pulled coordinate data out of articles and linked them up via interwiki links. Results within an approximately ~20-30km range are sorted by distance and optionally filtered by text, then spit out in a list with little thumbnails.

Links are currently to the experimental WAP gateway, which tends to load rather faster over a slow connection and has nice big text. ;)

A small iPhone native app could pull up the device’s location, then pass it off to a web application like this (or present a similar list in native UI).

This does raise some questions to consider…

  • What are the privacy implications of getting people’s physical locations in addition to just their IP addresses? Would we have to update our privacy policy? How should we log or aggregate data for debugging and statistics purposes?
  • Do Wikimedia’s open-source software policies conflict with the various restrictions on software distribution for locked-down devices like the iPhone?

Update 2008-03-21: While we’re still waiting for Apple to let us into the beta program for official development, some enterprising individual has made an unofficial app for this which should install on jailbroken phones.