HttpOnly cookies

Thanks to Werdna’s implementation of support, and Tim’s mass upgrade of our older PHP installations, I’ve today enabled the use of HttpOnly cookies on the Wikimedia wikis for our login session data.

“What’s that,” I hear you say, “and why do I want it?”

The HttpOnly marker on cookies tells a supporting browser that the cookie will only be used directly by the web server (sent only with the HTTP requests for each page), so it will hide the cookie from any JavaScript client code which asks for it.

This provides protection against certain kinds of security vulnerabilities — namely, XSS attacks which steal authenticated session and long-term login token cookies.

HttpOnly doesn’t fix XSS, not by a long shot, but it does reduce what an attacker can do; particularly nice when we’re soon going to start using global login cookies which will allow a unified account to continue a login session across multiple wikis on different domains.

The same origin policy prevents JavaScript on one subdomain from directly accessing another domain. Keeping the cross-domain session cookies away from compromised JavaScript will help prevent a hypothetical attack on one domain from jumping to other subdomains without the vulnerability.

Unfortunately, this marker isn’t standard; it’s an extension which Microsoft added for Internet Explorer in 6.0 SP1, but support has been slowly creeping into other browsers, finally hitting Firefox somewhere in the 2.0 patch cycle while nobody was looking.

Browsers I tested that currently support HttpOnly cookies:

  • IE/Win 6 SP1 or 7
  • Firefox 2.0.0.5 or later
  • Opera 9.50 beta
  • Konqueror (3.4?)

Other browsers will still expose the cookies to JavaScript, as they always have:

  • Safari 3.1
  • Opera 9.27 (current non-Beta release)
  • Old scary browsers like IE for Mac and Netscape 4 ;)

There’s a rumor that some versions of WebTV fail altogether when the cookies are marked this way, but I have no way to confirm or deny that yet.

Update 2008-05-01: Mac IE turns out to eat HttpOnly cookies…. sometimes… when the moon is just right. :) Added a browser blacklist, so we feed Mac IE regular cookies. Other browsers are still given the benefit of the doubt.

SUL status update…

Status update…

CentralAuth global logins are still restricted to the sysop beta, but Werdna and Tim have been doing some good work on cleaning things up…

  • Tim’s done a lot of code refactoring to clean up User object behavior
  • Werdna’s added support for global sessions based on Tim’s suggested model. Tim and I have helped with some cleanup on it…
  • I put together a threat assessment of the security impact of global session cookies and some mitigration strategies
  • One of my suggestions was to use HttpOnly mode for session and token cookies, where browsers support them. This will largely block XSS attacks from jumping between subdomains or stealing cookies for reuse by an attacker. Werdna’s added support for HttpOnly cookies under PHP 5.2; currently we can’t deploy this until we finish upgrading some of our machines.
  • I’ve enabled global sessions on secure.wikimedia.org, where there’s a single domain and few other services to increase the attack surface. It _seems_ to mostly work so far. ;)

    Logging out doesn’t quite clear all sessions correctly yet, but so far so good. :)

HTML to PDF, why so hard?

I’ve been testing out MediaWiki PDF export using PediaPress’s mwlib & mwlib.rl. This system uses a custom MediaWiki parser written in Python, which then calls out to a PDF generator library to assemble a pretty, printable PDF output file.

The PediaPress folks are responsive to bug reports, but in the long run I worry that this would be a difficult system to maintain. The alternate parser/renderer needs to reimplement not only MediaWiki’s core markup syntax, but support for every current and future parser or media format extension we roll out into production usage.

Something based on the XHTML we already generate would be the most future-proof export system. This could of course be HTML that’s geared specifically for print, say by including higher-resolution images and making use of vector versions of math and SVG more readily, among other things.

Ideally, we’d be able to use common open-source browser engines like Gecko or WebKit for this — engines we already know render our sites pretty well. Unfortunately there doesn’t yet seem to be a standard kit for using them to do headless print export.

I did some scouring around and found a few other HTML-to-PDF options, starting with those used by some MediaWiki extensions…

HTMLDoc

  • GPL/commercial dual-licence; C
  • Used by Pdf Book and Pdf Export extensions.
  • Seems to have absolutely ancient HTML support… no style sheets, no Asian text, etc…
  • Verdict: NO

dompdf

  • LGPL; PHP
  • Used by Pdf Export Dompdf extension.
  • DOM-based HTML & CSS to PDF converter written in PHP… Sounds relatively cute, but development seems to have fallen off in 2006 and support remains incomplete.
  • Verdict: NO

Googling about I stumbled upon some other fun…

Dynalivery Gecko

  • Commercial? Demo?
  • Online demo of an actual use of Gecko as an HTML-to-PDF print server! Seems to be some commercial thing, and the output quality indicates it’s a very old Gecko, with lots of printing bugs.
  • Neat to see it, though!
  • Verdict: NO

PrinceXML

  • Proprietary; server license $3800
  • Great quality and flexibility; this would be a great choice in the commercial world. :) They have some Wikipedia samples done with a custom two-column stylesheet which are quite attractive.
  • Not being open source, alas, is a killer here.
  • Verdict: NO

CSSToXSLFO

  • Public domain; Java
  • Converts XHTML+CSS2 to XSL-FO, which can then be rendered out to PDF using more open-source components. Seems under active development, last release in December 2007.
  • Might be pretty nice, but my last experience playing with XSL-FO via Apache FOP in 2005 or so was very painful, with lots of unsupported layout features.
  • Verdict: try me and see

LuceneSearch is dead, long live MWSearch!

I’ve made a couple more cleanup fixes to the core search UI behavior, so namespace selections work more consistently, and have gone ahead and switched it in as the sole search interface on all Wikimedia wikis.

This means the LuceneSearch extension is officially obsolete. The MWSearch extension provides a back-end plugin for MediaWiki’s core search user interface, and all further front-end work should be done in core where it’ll benefit everybody.

Note that many Wikimedia sites have put in local JavaScript hacks to add extra external search options to the form; unfortunately they have used particular form IDs specific to the old, obsolete extension.

I took the liberty of adapting the English Wikipedia’s JS to work with either case.

Please feel free to pass that fix on to other wikis.

First production <video> tag support lands… without Ogg support

So, Apple pushed out Safari 3.1 for Mac and Windows today, which adds support for the HTML 5 <video> tag… unfortunately, without native Ogg support. :(

Fortunately, it uses QuickTime as the backend, so if you have the XiphQT plugins installed, it will play Ogg Vorbis and Theora files. Yay!

Filed two three bugs for our video plugin detection on Safari…

Wikipedia search front-end updates

Ok, first things first: I’ve lifted the old restriction that disabled search context display for visitors who aren’t logged in. This makes searching much nicer!

The restriction was put in place in November 2005 as a temporary emergency measure to relieve pressure on the database servers. In the intermediate time, our capacity has caught up a lot better and switching it back in caused only a modest bump in visible server load.

Looking forward, I’ve set it up so you can test how the new search plugin will look on Wikipedia once we’ve fully enabled it:

The new plugin uses MediaWiki’s core search UI. Right now the only “cool” thing in it is the image thumbnail display, but as we continue to improve it it’s going to become a lot prettier for all MediaWiki users, not just for Wikipedia.

Update: The core search UI has been partially updated, and now more or less matches the style of the LuceneSearch plugin. Still some tweaks to do, and plenty of further improvements to make!

TODO: MediaWiki’s MySQL search backend

Some problems and solutions…

Problem 0: Wildcard searches don’t work

  • This was fixed in 1.12! You can search for “releas*” and match both “release” and “releasing” etc.

Problem 1: Minimum length and stopwords

  • People don’t like when their searches can’t turn up their favorite acronyms and such
    You can tweak the MySQL configuration… server-wide… if you have enough permissions on the server…

  • We can hack a transformation like we do for Unicode: append x00 or such to small words to force them to be indexed.

Problem 2: The table crashes sometimes

  • People often get mystified when the searchindex table is marked crashed.
  • Catch the error: try a REPAIR TABLE transparently, and display a friendlier error if that fails.

Problem 3: Separate title and text search results are ugly and hard to manage

  • People are used to Google-style searches where you just get one set of results which takes both title and body text into account.
  • Merge the title into the text index and return one set of results only.

Problem 4: Needs to join to ‘page’ table

  • The search does joins to the ‘page’ table to do namespace & redirect filtering and to return the original page title for result display. These joins can cause ugly slow locks, mixing up the InnoDB world with the MyISAM world.
  • Denormalize: add fields for namespace, original title, and redirect status to ‘searchindex’ table.