Denied

Every once in a while we get a rash of complaints about how Wikipedia renders on somebody’s mobile phone or PDA web browser.

Unfortunately we don’t seem to have a lot of such devices ourselves to test with, and they all behave differently, so we haven’t really had the resources to seriously work on testing tweaks for better phone/handheld support.

My crappy little phone just has some kind of WAP browser, I think, which I’ve never really been able to get to do anything productive. For kicks I took a peek to see if Opera Mini would run on my phone, since I *think* it has some kind of Java… alas, once I finally got connected to the site it claimed my phone isn’t supported.

Opera Mini actually handles Wikipedia pretty decently. The really cool thing is that since the client side is Java, they have an applet version so you can test the actual Opera Mini rendering in any desktop web browser. I LOVE YOU OPERA! YOU MAKE IT POSSIBLE TO TEST MY WEB SITE IN YOUR PRODUCT!!!! IIII LLLLLOOOOVVEEE YYOOOUUUUU!!!!!!!ONE

Of course currently our fundraising drive notice covers the entire screen, but… hey… ;)

Update 2007-01-07: I’m collecting links to mobile testing resources at [[mw:Mobile browser testing]].

There is no Brion, only SUL

In the run-up to Wikimania, I plan to get the single-user-login code base sufficiently up to stuff to at least demo it at the conference, and to be able to snap it into action probably shortly after.

Milestones:

1) logging in on local DBs
2) new account creation on local DBs
3) migration-on-first-login of matching local accounts on local DBs
4) migration-on-first-login of non-matching local accounts on local DBs
5) renaming-on-first-login of non-matching local accounts on local DBs
6) provision for forced rename on local DBs
7) basic login for remote DBs
8) new account for remote DBs
9) migration for remote DBs
10) profit!

additional goodies:
11) secure login form
12) multiple-domain cookies to allow site-hopping

dbzip2 production testing

The English Wikipedia full-history data dump (my arch-nemesis) died again while building due to a database disconnection. I’ve taken the opportunity to clean up dbzip2 a little more and restart the dump build using it.

The client now handles server connections dropping out, and can even reconnect when they come back, so it should be relatively safe for a long-running process. The remote daemon also daemonizes properly, instead of leaving zombies and breaking your terminal.

Using six remote dbzip2d threads, and the faster 7zip decompression for the data prefetch, I’m getting about 6.5 megabytes per second of (pre-compression XML) throughput average, peaking around 11 mb/sec. A big improvement over what I was measuring with the local threads, by a factor of 5 or so. If this holds up, it should actually complete in “just” two or three days…

Of course that’s assuming the database connection doesn’t drop again! Another thing to improve…

dbzip2 vincit

I’ve managed to bang my dbzip2 prototype into a pretty decent state now, rewriting some of the lower-level bitstream code as a C module while keeping the high-level bits in Python.

It divides up input into proper-sized blocks and combines output blocks into a single output stream, achieving bit-for-bit compatibility with single-threaded standard bzip2. While still slower than bzip2smp for local threads, I was quite pleased to find it scales to multiple remote threads well enough to really look worth it:

This was using Wikimedia’s database servers; beefy Opteron boxes with gigabit ethernet and usually a lot of idle CPU cycles while they wait on local disk I/O.

The peak throughput on my initial multiple-server tests was about 24 megabytes per second with 10 remote threads, and I was able to get over 19 megs/sec on my full gigabyte test file, compressing it in under a minute. With some further work and better stability, this could be really helpful in getting the big data dumps going faster.

Next step: parallel decompression…?

dbzip2 continues

Still fiddling with distributed bzip2 compression in the hopes of speeding up data dump generation.

My first proof of concept worked similarly to the pbzip2 I came across: input was broken into blocks, each block sent out to local (and remote) threads to be separately compressed as its own little bzip2 stream. The streams were then concatenated together in order.

This works surprisingly well, but has serious compatibility problems. While the standard bzip2 utility happily decompresses all the streams you care to place into one input file, other users of the library functions don’t expect this: the tools I needed to work with the dumps would end with the first stream.

A more compatible implementation should produce a single stream with multiple data blocks in the expected sequence, but the file format doesn’t seem to be well documented.

In my research I came across another parallel bzip2 implementation, bzip2smp. Its author had also found pbzip2 and rejected it, preferring instead to hack the core bzip2 library enough to parallelize the slowest part of it, the Burrows-Wheeler transform. The author claims the output is bit-for-bit identical to the output of regular bzip2, which is obviously attractive.

I’m not 100% sure how easy or practical it would be to extend that to distributed use; the library code is scary and lots of parts are mixed together. Before diving in, I decided to start poking around at the file format and the compression steps to get a better idea of what happened where.

I’ve been putting together some notes on the format. The basic stream itself is slightly complicated by being a true bitstream — blocks in output are not aligned to byte boundaries!

As a further proof of concept I’ve hacked up dbzip2 to combine the separately compressed blocks into a single stream, complete with a combined CRC at the end so bzip2 doesn’t barf and cut off the end of the data. This should be more compatible, but it’s not suitable for use right now: the bit-shifting is done in Python and way too slow. Additionally the input block cutting is probably not totally reliable, and certainly doesn’t produce the same size blocks as real bzip2.

Replicating the RLE performed on input could get the blocks the same size, but at that point it might start to make sense to start using the actual library (or a serious hack of it) instead of the scary Python wrapper I’ve got now.

Mindless, safe inline XML generation in PHP

Well, almost…

text = $text;
	}
	function toString() {
		return $this->text;
	}
}

function squash_attribs($attribs) {
	return implode(' ', array_map('squash_attrib',
		array_keys($attribs), array_values($attribs)));
}

function squash_attrib($key, $value) {
	return htmlspecialchars($key) . '="' . htmlspecialchars($value) . '"';
}

function xx() {
	$args = func_get_args();
	$name = array_shift($args);
	$next = array_shift($args);
	if (is_array($next)) {
		$attribs = ' ' . squash_attribs($next);
		$next = array_shift($args);
	} else {
		$attribs = '';
	}
	$contents = null;
	while (null !== $next) {
		if (is_object($next)) {
			$contents .= $next->toString();
		} else {
			$contents .= htmlspecialchars( $next );
		}
		$next = array_shift($args);
	}
	if ($contents === null) {
		return new XmlFragment("<$name$attribs />");
	} else {
		return new XmlFragment("<$name$attribs>$contents");
	}
}

$page = xx("html", array("lang" => "en"),
	xx("body",
		xx("h1", "Some thingy"),
		xx("p", "This & that. Nothing important..."),
		xx("hr"),
		xx("h2", "Log in"),
		xx("form", array("method" => "post", "action" => "/submit.php"),
			xx("label", array("for" => "user"), "Username:"),
			" ",
			xx("input", array("name" => "user")),
			xx("br"),
			xx("label", array("for" => "password"), "Password:"),
			" ",
			xx("input", array("name" => "password", "type" => "password")),
			xx("br"),
			xx("input", array("type" => "submit"))),
		xx("hr"),
		xx("p", "That's , folks!")));
echo $page->toString();

?>