Node.js and web workers: safe multiprocess CLI JS

I’ve been fiddling with the Node.js CLI/server-side JavaScript environment for a number of experiments, and have started on using it to build batch tests for the new MediaWiki parser work.

Our parsing experiments are being started in JavaScript so we can bundle it with the in-progress editing tools and run them on existing MediaWiki sites as a gadget for testing; as things mature, a PHP version will get made to take over from the older parser classes in MediaWiki proper.

For a simple batch test, I’ve got a JavaScript module written in Node.js that processes a Wikipedia XML data dump, runs each page’s text through the parser, and serializes it back to text to check for any round-tripping problems.

There’s a few gotchas working with Node.js, but it’s generally a pretty nice way to get started!

Getting started: running browser code in Node

The parsing code so far has been designed to run in-browser, loaded by MediaWiki’s ResourceLoader system.

Some of that code uses jQuery helper functions like $.each(); jQuery can be provided easily via an npm module, but there’s still a trick.

In the browser, global vars from every script go implicitly into the shared global namespace (usually the ‘window’ object) and get picked up by other scripts, so they need only reference ‘$’ or ‘PegParser’. But in Node.js’s module system, implicit globals end up being private to each module! Only things exposed through module.exports can be accessed from your other scripts.

The wrapped module for jQuery already handles the export, so to make it available to my parser scripts I can copy it into Node’s explicit ‘global’ namespace object, which makes things available from any module:

// For now most modules only need this for $.extend and $.each :)
global.$ = require('jquery');

I was able to get them easily export their public functions/classes to module.exports if it’s present, letting me use the same code in browser and Node:

if (typeof module == "object") {
    module.exports.PegParser = PegParser;

Conveniently the PEG.js library already did that for me. :)

Web workers: safe multithreading

Recent web browser standards have introduced the Web Workers system for doing multi-threaded JavaScript safely. Unlike traditional threading, the workers don’t share any direct state or context with the parent; they’re essentially separate JavaScript programs that can communicate only through JSON message-passing.

While this is in some ways limited, it does mean you can skip most of the horrible synchronization primitives and confusing explody things from low-level threading that you might be used to in Java or C++!

It’s also been implemented for Node.js, as a module that spawns workers as subprocesses communicating over Unix domain sockets. It may not sound much fancier than just spawning a subprocess and piping explicitly, but you get a pre-defined message-passing architecture that’s more suitable to structured data than a raw byte stream pipe.

To scale automated batch tests over multiple CPU cores, I moved the parsing portion of the test runner into a worker. Spawn a few of those, and as revisions come in from the wiki dump being run we pass them down to the open worker threads, pausing the input when we have several queued up to avoid overflowing on the main thread.

Another nice thing is that this same model, in theory, should work for a browser-based version of the tests. (oooh!)

Web workers in Node: mostly sane

I hit a few gotchas which took me a while to figure out…

First, you might not see all errors from the worker script. Your life will be made much simpler if you define an onerror handler in your parent thread! Most of the explosive confusing errors below will at least show up there…

Also near the top is that the local path doesn’t seem to get properly set on the worker script; you can’t for instance do a require() module load with a relative path. I’m working around this by sending the path as an ‘init’ message to the worker, which can then start loading up its other modules and files.

The most insidious though… is that you can’t modify the ‘global’ global namespace object from your worker. Whaaaa? Since I need to do that to put jQuery and my own browser-oriented modules into the global namespace, for now I just split all the actual work bits into a second module that I require() from the worker itself. Amazingly this works just fine — I just have to pass a couple functions from the worker context itself to the other module, and it seems to work. :D