HTML to PDF, why so hard?

I’ve been testing out MediaWiki PDF export using PediaPress’s mwlib & mwlib.rl. This system uses a custom MediaWiki parser written in Python, which then calls out to a PDF generator library to assemble a pretty, printable PDF output file.

The PediaPress folks are responsive to bug reports, but in the long run I worry that this would be a difficult system to maintain. The alternate parser/renderer needs to reimplement not only MediaWiki’s core markup syntax, but support for every current and future parser or media format extension we roll out into production usage.

Something based on the XHTML we already generate would be the most future-proof export system. This could of course be HTML that’s geared specifically for print, say by including higher-resolution images and making use of vector versions of math and SVG more readily, among other things.

Ideally, we’d be able to use common open-source browser engines like Gecko or WebKit for this — engines we already know render our sites pretty well. Unfortunately there doesn’t yet seem to be a standard kit for using them to do headless print export.

I did some scouring around and found a few other HTML-to-PDF options, starting with those used by some MediaWiki extensions…

HTMLDoc

  • GPL/commercial dual-licence; C
  • Used by Pdf Book and Pdf Export extensions.
  • Seems to have absolutely ancient HTML support… no style sheets, no Asian text, etc…
  • Verdict: NO

dompdf

  • LGPL; PHP
  • Used by Pdf Export Dompdf extension.
  • DOM-based HTML & CSS to PDF converter written in PHP… Sounds relatively cute, but development seems to have fallen off in 2006 and support remains incomplete.
  • Verdict: NO

Googling about I stumbled upon some other fun…

Dynalivery Gecko

  • Commercial? Demo?
  • Online demo of an actual use of Gecko as an HTML-to-PDF print server! Seems to be some commercial thing, and the output quality indicates it’s a very old Gecko, with lots of printing bugs.
  • Neat to see it, though!
  • Verdict: NO

PrinceXML

  • Proprietary; server license $3800
  • Great quality and flexibility; this would be a great choice in the commercial world. :) They have some Wikipedia samples done with a custom two-column stylesheet which are quite attractive.
  • Not being open source, alas, is a killer here.
  • Verdict: NO

CSSToXSLFO

  • Public domain; Java
  • Converts XHTML+CSS2 to XSL-FO, which can then be rendered out to PDF using more open-source components. Seems under active development, last release in December 2007.
  • Might be pretty nice, but my last experience playing with XSL-FO via Apache FOP in 2005 or so was very painful, with lots of unsupported layout features.
  • Verdict: try me and see