HTML to PDF, why so hard?

I’ve been testing out MediaWiki PDF export using PediaPress’s mwlib & mwlib.rl. This system uses a custom MediaWiki parser written in Python, which then calls out to a PDF generator library to assemble a pretty, printable PDF output file.

The PediaPress folks are responsive to bug reports, but in the long run I worry that this would be a difficult system to maintain. The alternate parser/renderer needs to reimplement not only MediaWiki’s core markup syntax, but support for every current and future parser or media format extension we roll out into production usage.

Something based on the XHTML we already generate would be the most future-proof export system. This could of course be HTML that’s geared specifically for print, say by including higher-resolution images and making use of vector versions of math and SVG more readily, among other things.

Ideally, we’d be able to use common open-source browser engines like Gecko or WebKit for this — engines we already know render our sites pretty well. Unfortunately there doesn’t yet seem to be a standard kit for using them to do headless print export.

I did some scouring around and found a few other HTML-to-PDF options, starting with those used by some MediaWiki extensions…

HTMLDoc

  • GPL/commercial dual-licence; C
  • Used by Pdf Book and Pdf Export extensions.
  • Seems to have absolutely ancient HTML support… no style sheets, no Asian text, etc…
  • Verdict: NO

dompdf

  • LGPL; PHP
  • Used by Pdf Export Dompdf extension.
  • DOM-based HTML & CSS to PDF converter written in PHP… Sounds relatively cute, but development seems to have fallen off in 2006 and support remains incomplete.
  • Verdict: NO

Googling about I stumbled upon some other fun…

Dynalivery Gecko

  • Commercial? Demo?
  • Online demo of an actual use of Gecko as an HTML-to-PDF print server! Seems to be some commercial thing, and the output quality indicates it’s a very old Gecko, with lots of printing bugs.
  • Neat to see it, though!
  • Verdict: NO

PrinceXML

  • Proprietary; server license $3800
  • Great quality and flexibility; this would be a great choice in the commercial world. :) They have some Wikipedia samples done with a custom two-column stylesheet which are quite attractive.
  • Not being open source, alas, is a killer here.
  • Verdict: NO

CSSToXSLFO

  • Public domain; Java
  • Converts XHTML+CSS2 to XSL-FO, which can then be rendered out to PDF using more open-source components. Seems under active development, last release in December 2007.
  • Might be pretty nice, but my last experience playing with XSL-FO via Apache FOP in 2005 or so was very painful, with lots of unsupported layout features.
  • Verdict: try me and see

11 thoughts on “HTML to PDF, why so hard?”

  1. The Cairo PDF Backend used by gecko is currently improved.
    The Moz. Fo. wanted to have an PDF export for FF3, unfortunately this feature seems to have been postponed. It’s only a question of time IMHO to have a good HTML2PDF feature inside Gecko.

    This way of generating PDF files offers following advantages :
    * simplicity
    * freeness
    * guaranty a good rendering (near to the HTML rendering)
    * do not need maintenance from a dedicated team

    For these reasons… it seems to me to be the most promising solution.

  2. Ideally, you would want to generate an XML document of the wikitext. The XML could then be converted to PDF, openoffice or XHTML.

    Have fun with XSL-FO ;o)

  3. I’d love to see a good solution for this. I’m stuck using HTMLDoc right now, and it is fairly painful due to the lack of CSS support.

    I’ll be watching with anticipation ;).

  4. I’ve tried new one
    PDFCreator it’s opensource and rendering is 100% output to pdf
    it just create new driver on print dialog box and have to print through pdfcreator.
    Mostly main thing is that it support for all application wheater it is IE Mozilla firefox word exel blah blah blah doesnot matter it just render you the perfect pdf file
    But they are very slow in rendering painfully slow

    http://sourceforge.net/projects/pdfcreator/

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>