emscripten fun: porting libffi to WebAssembly part 1

I have a silly dream of seeing graphical Linux/FOSS programs running portably on any browser-based system outside of app store constraints. Two of my weekend side projects are working in this direction: an x86 emulator core to load and run ELF binaries, and better emscripten cross-compilation support for the GTK+ stack.

Emulation ideas

An x86 emulator written in WebAssembly could run pre-built Linux binaries, meaning in theory you could make an automated packager for anything in a popular software repository.

But even if all the hard work of making a process-level emulator work, and hooking up the Linux-level libraries to emulated devices for i/o and rendering, there are some big performance implications, and you’re probably also bundling lots of library code you don’t need at runtime.

Instruction decoding and dispatch will be slow, much slower than native. And it looks pretty complex to do JIT-ing of traces. While I think it could be made to work in principle, I don’t think it’ll ever give a satisfactory user experience.

Cross-compilation

Since we’ve got the source of Linux/FOSS programs by definition, cross-compiling them directly to WebAssembly will give far better performance!

In theory even something from the GNOME stack would work, given an emscripten-specific gdk backend rendering to a WebGL canvas just like games use SDL2 or similar to wrap i/o.

But long before we can get to that, there are low-level library dependencies.

Let’s start with glib, which implements a bunch of runtime functions and the GObject type system, used throughout the stack.

Glib needs libffi, a library for calling functions with at-runtime call signatures and creating closure functions which enclose a state variable around a callback.

In other words, libffi needs to do things that you cannot do in standard C, because it needs system-specific information about how function arguments and return values are transferred (part of the ABI, application binary interface). And to top it off, in many cases (emscripten included) you still can’t do it in C, because asm.js and WebAssembly provide no way to make a call with an arbitrary argument list. So, like a new binary platform, libffi must be ported…

It seems to be doable by bumping up to JavaScript, where you can construct an array of mixed-type arguments and use Function.prototype.apply to call the target function. Using an EM_ASM_ block in my shiny new wasm32/ffi.c I was able to write a JavaScript implementation of the guts of ffi_call which works for int, float, and double parameters (still have to implement 64-bit ints and structs).

The second part of libffi is the closure creation API, which I think requires creating a closure function in the JavaScript side, inserting it into the module’s function tables, and then returning the appropriate index as its address. This should be doable, but I haven’t started yet.

Emscripten function pointers

There are two output targets for the emscripten compiler: asm.js JavaScript and WebAssembly. They have similar capabilities and the wrapper JS is much the same in both, but there are some differences in implementation and internals as well as the code format.

One is in function tables for indirect calls. In both cases, the low-level asm.js/WASM code can’t store the actual pointer address of a function, so they use an index into a table of functions. Any function whose address is taken at compile time is added to the table, and its index used as the address. Then, when an indirect call through a “function pointer” is made, the pointer is used as the index into the function table, and an actual function call is made on it. Voila!

In asm.js, there are lots of Weird Things done about type consistency to make the JavaScript compilers fast. One is that the JS compiler gets confused if you have an array of functions that _contain different signatures_, making indirect calls run through slower paths in final compiled code. So for each distinct function signature (“returns void” or “returns int, called with float32” etc) there was a separate array. This also means that function pointers have meaning only with their exact function signature — if you take a pointer to a function with a parameter and call it without one, it could end up calling an entirely different function at runtime because that index has a different meaning in that signature!

In WebAssembly, this is handled differently. Signatures are encoded at the call site in the call_indirect opcode, so no type inference needs to be done.

But.

At least currently, the asm.js-style table separation is still being used, with the multiple tables encoded into the single WebAssembly table with a compile-time-known constant offset.

In both cases, the JS side can do indirect calls by calculating the signature string (“vf” for “void return, float32 arg” etc) and calling the appropriate “dynCall_vf” etc method, passing first the pointer and then the rest of the argument list. On asm.js this will look up in the tables directly; on WASM it’ll apply the index.

(etc)

It’s possible that emscripten will change the WASM mode to use a single array without the constant offset indirection. This will simplify lookups, and I think make it easier to add more functions at runtime.

Because you see, if you want to add a callback at runtime, like libffi’s closure API wants to, then you need to add another entry to that table. And in asm.js the table sizes are fixed for asm.js validation rules, and in WASM current mode the sub-tables are definitely fixed at compile time, since those constant offsets are used throughout.

So currently there’s an option you can use at build time to reserve room for runtime function pointers, I think I’ll have to use it, but that only reserves *fixed space* of a given number of pointers.

Next

Coming up next time: int64 and struct in the emscripten ABI, and does the closure API work as expected?

String concatenation garbage collection madness!

We got a report of a bug with the new 3D model viewing extension on Wikimedia Commons, where a particular file wasn’t rendering a thumbnail due to an out-of-memory condition. The file was kind-of big (73 MiB) but not super huge, and should have been well within the memory limits of the renderer process.

On investigation, it turned out to be a problem with how three.js’s STLLoader class was parsing the ASCII variant of the file format:

  • First, the file is loaded as a binary ArrayBuffer
  • Then, the buffer is checked to see whether it contains binary or text-format data
  • If it’s text, the entire buffer is converted to a string for further processing

That conversion step had code that looked roughly like this:

var str = '';
for (var i = 0; i < arr.length; i++) {
    str += String.fromCharCode(arr[i]);
}
return str;

Pretty straightforward code, right? Appends one character to the string until the input binary array is out, then returns it.

Well, JavaScript strings are actually immutable — the “+=” operator is just shorthand for “str = str + …”. This means that on every step through the new loop, we create two new strings: one for the character, and a second for the concatenation of the previous string with the new character.

The JavaScript virtual machine’s automatic garbage collection is supposed to magically de-allocate the intermediate strings once they’re no longer referenced (at some point after the next run through the loop) but for some reason this isn’t happening in Node.js. So when we run through this loop 70-some million times, we get a LOT of intermediate strings still in memory and eventually the VM just dies.

Remember this is before any of the 3d processing — we’re just copying bytes from a binary array to a string, and killing the VM with that. What!?

Newer versions of the STLLoader use a more efficient path through the browser’s TextDecoder API, which we can polyfill in node using Buffer, making it blazing fast and memory-efficient… this seems to fix the thumbnailing for this file in my local testing.

Just for fun though I thought, what would it take to get it working in Node or Chrome without the fancy native API helpers? Turns out you can significantly reduce the memory usage of this conversion just by switching the order of operations….

The original append code results in operations like: (((a + b) + c) + d) which increases the size of the left operand linearly as we go along.

If we instead do it like ((a + b) + (c + d)) we’ll increase _both_ sides more slowly, leading to much smaller intermediate strings clogging up the heap.

Something like this, with a sort of binary bisection thingy:

function do_clever(arr, start, end) {
    if (start === end) {
        return '';
    } else if (start + 1 === end) {
        return String.fromCharCode(arr[start]);
    } else {
        var mid = start + Math.floor((end - start) / 2);
        return do_clever(arr, start, mid) +
               do_clever(arr, mid, end);
    }
}

return do_clever(arr, 0, arr.length);

Compared to the naive linear append, I’m able to run through the 73 MiB file in Node, and it’s a bit faster too.

But it turns out there’s not much reason to use that code — most browsers have native TextDecoder (even faster) and Node can fake it with another native API, and those that don’t are Edge and IE, which have a special optimization for appending to strings.

Yes that’s right, Edge 16 and IE 11 actually handle the linear append case significantly faster than the clever version! It’s still not _fast_, with a noticeable delay of a couple seconds on IE especially, but it works.

So once the thumbnail fix goes live, that file should work both in the Node thumbnailer service *and* in browsers with native TextDecoder *and* in Edge and IE 11. Yay!

emscripten versus IE 11: arithmetic optimization for ogv.js

ogv.js is a web video & audio playback engine for supporting the free & open Ogg and WebM formats in browsers that don’t support them natively, such as Safari, Edge, and Internet Explorer. We use it at Wikipedia and Wikimedia Commons, where we don’t currently allow the more common MP4 family of file formats due to patent concerns.

IE 11, that old nemesis, still isn’t quite gone, and it’s definitely the hardest to support. It’s several years old now, with all new improvements going only into Edge on Windows 10… so no WebAssembly, no asm.js optimizations, and in general it’s just kind of ….. vveerryy ssllooww compared to any more current browser.

But for ogv.js I still want to support it as much as possible. I found that for WebM videos using the VP8 or VP9 codecs, there was a *huge* slowdown in IE compared to other browsers, and wanted to see if I could pick off some low-hanging fruit to at least reduce the gap a bit and improve playback for devices right on the edge of running smoothly at low resolutions…

Profiling in IE is a bit tough since the dev tools often skew JS performance in weird directions… but always showed that a large bottleneck was the Math.imul polyfill.

Math.imul, on supporting browsers, is a native function that implements 32-bit integer multiplication correctly and very very quickly, including all the weird overflow conditions that can result from multiplying large numbers — this is used in the asm.js code produced by the emscripten compiler to make sure that multiplication is both fast and correct.

But on IE 11 it’s not present, so a replacement function (“polyfill”) is used by emscripten instead. This does several bit shifts, a couple multiplications, blah blah details, anyway even when the JIT compiler inlines the function, it’s slower than necessary.

I hacked together a quick test to search the generated asm.js code for calls to the minified reference to Math.imul, and replace them with direct multiplication… and found significant performance improvements!

I also found it broke some of the multiplications by using wrong order of operations though, so replaced it with a corrected transformation that instead of a regex on the code, uses a proper JS parser, walks the tree for call sites, and replaces them with direct multiplication… after some more confusion with my benchmarking, I confirmed that the updated code was still faster:

This is about a 15-20% improvement, plus or minus, which seems a pretty significant bump!

Of course more modern browsers like current versions of Safari and Edge will use the Web Assembly version of ogv.js anyway, and are several times faster…

 

Brain dump: JavaScript sandboxing

Another thing I’ve been researching is safe, sandboxed embedding of user-created JavaScript widgets… my last attempt in this direction was the EmbedScript extension (examples currently down, but code is still around).

User-level problems to solve:

  • “Content”
    • Diagrams, graphs, and maps would be more fun and educational if you could manipulate them more
    • What if you could graph those equations on all those math & physics articles?
  • Interactive programming sandboxes
  • Customizations to editor & reading UI features
    • Gadgets, site JS, shared user JS are potentially dangerous right now, requiring either admin review or review-it-yourself
    • Narrower interfaces and APIs could allow for easier sharing of tools that don’t require full script access to the root UI
  • Make scriptable extensions safer
    • Use same techniques to isolate scripts used for existing video, graphs/maps, etc?
    • Frame-based tool embedding + data injection could make export of rich interactive stuff as easy as InstantCommons…

Low-level problems to solve

  • Isolating user-provided script from main web context
  • Isolating user-provided script from outside world
    • loading off-site resources is a security issue
    • want to ensure that wiki resources are self-contained and won’t break if off-site dependencies change or are unavailable
  • Providing a consistent execution environment
    • browsers shift and change over time…
  • Communicating between safe and sandboxed environments
    • injecting parameters in safely?
    • two-way comms for allowing privileged operations like navigating page?
    • two-way comms for gadget/extension-like behavior?
    • how to arrange things like fullscreen zoom?
  • Potential offline issues
    • offline cacheability in browser?
    • how to use in Wikipedia mobile apps?
  • Third-party site issues
    • making our scripts usable on third-party wikis like InstantCommons
    • making it easy for third-party wikis to use these techniques internally

Meta-level problems to solve

  • How & how much to review code before letting it loose?
  • What new problems do we create in misuse/abuse vectors?

Isolating user-provided scripts

One way to isolate user-provided scripts is to run them in an interpreter! This is potentially very slow, but allows for all kinds of extra tricks.

JS-Interpreter

I stumbled on JS-Interpreter, used sometimes with the Blockly project to step through code generated from visual blocks. JS-Interpreter implements a rough ES5 interpreter in native JS; it’s quite a bit slower than native (though some speedups are possible; the author and I have made some recent tweaks improving the interpreter loop) but is interesting because it allows single-stepping the interpreter, which opens up to a potential for an in-browser debugger. The project is under active development and could use a good regression test suite, if anyone wants to send some PRs. :)

The interpreter is also fairly small, weighing in around 24kb minified and gzipped.

The single-stepping interpreter design protects against infinite loops, as you can implement your own time limit around the step loop.

For pure-computation exercises and interactive prompts this might be really awesome, but the limited performance and lack of any built-in graphical display means it’s probably not great for hooking it up to an SVG to make it interactive. (Any APIs you add are your own responsibility, and security might be a concern for API design that does anything sensitive.)

Caja

An old project that’s still around is Google Caja, a heavyweight solution for embedding foreign HTML+JS using a server-side Java-based transpiler for the JS and JavaScript-side proxy objects that let you manipulate a subset of the DOM safely.

There are a number of security advisories in Caja’s history; some of them are transpiler failures which allow sandboxed code to directly access the raw DOM, others are failures in injected APIs that allow sandboxed code to directly access the raw DOM. Either way, it’s not something I’d want to inject directly into my main environment.

There’s no protection against loops or simple resource usage like exhausting memory.

Iframe isolation and CSP

I’ve looked at using cross-origin <iframe>s to isolate user code for some time, but was never quite happy with the results. Yes, the “same-origin policy” of HTML/JS means your code running in a cross-origin frame can’t touch your main site’s code or data, but that code is still able to load images, scripts, and other resources from other sites. That creates problems ranging from easy spamming to user information disclosure to simply breaking if required offsite resources change or disappear.

Content-Security-Policy to the rescue! Modern browsers can lock down things like network access using CSP directives on the iframe page.

CSP’s restrictions on loading resources still leaves an information disclosure in navigation — links or document.location can be used to navigate the frame to a URL on a third domain. This can be locked down with CSP’s childsrc param on the parent document — or an intermediate “double” iframe — to only allow the desired target domain (say, “*.wikipedia-embed.org” or even “item12345678.wikipedia-embed.org”). Then attempts to navigate the frame to a different domain from the inside are blocked.

So in principle we can have a rectangular region of the page with its own isolated HTML or SVG user interface, with its own isolated JavaScript running its own private DOM, with only the ability to access data and resources granted to it by being hosted on its private domain.

Further interactivity with the host page can be created by building on the postMessage API, including injecting additional resources or data sets. Note that postMessage is asynchronous, so you’re limited in simulating function calls to the host environment.

There is one big remaining security issue, which is that JS in an iframe can still block the UI for the whole page (or consume memory and other resources), either accidentally with an infinite loop or on purpose. The browser will eventually time out from a long loop and give you the chance to kill it, but it’s not pleasant (and might just be followed by another super-long loop!)

This means denial of service attacks against readers and editors are possible. “Autoplay” of unreviewed embedded widgets is still a bad idea for this reason.

Additionally, older browser versions don’t always support CSP — IE is a mess for instance. So defenses against cross-origin loads either need to somehow prevent loading in older browsers (poorer compatibility) or risk the information exposure (poorer security). However the most popular browsers do enforce it, so applications aren’t likely to be built that rely on off-site materials just to function, preventing which is one of our goals.

Worker isolation

There’s one more trick, just for fun, which is to run the isolated code in a Web Worker background thread. This would still allow resource consumption but would prevent infinite loops from blocking the parent page.

However you’re back to the interpreter’s problem of having no DOM or user interface, and must build a UI proxy of some kind.

Additionally, there are complications with running Workers in iframes, which is that if you apply sandbox=allow-scripts you may not be able to load JS into a Worker at all.

Non-JavaScript languages

Note that if you can run JavaScript, you can run just about anything thanks to emscripten. ;) A cross-compiled Lua interpreter weighs in around 150-180kb gzipped (depending on library inclusion).

Big chart

Here, have a big chart I made for reference:

Offline considerations

In principle the embedding sites can be offline-cached… bears consideration.

App considerations

The iframes could be loaded in a webview in apps, though consider the offline + app issues!

Data model

A widget (or whatever you call it) would have one or more sub resources, like a Gadget does today plus more:

  • HTML or SVG backing document
  • JS/CSS module(s), probably with a dependency-loading system
  • possibly registration for images and other resources?
    • depending on implementation it may be necessary to inject images as blobs or some weird thing
  • for non-content stuff, some kind of registry for menu/tab setup, trigger events, etc

Widgets likely should be instantiable with input parameters like templates and Lua modules are; this would be useful for things like reusing common code with different input data, like showing a physics demo with different constant values.

There should be a human-manageable UI for editing and testing these things. :) See jsfiddle etc for prior art.

How to build the iframe target site

Possibilities:

  • Subdomain per instance
    • actually serve out the target resources on a second domain, each ‘widget instance’ living in a separate random subdomain ideally for best isolation
    • base HTML or SVG can load even if no JS. Is that good or bad, if interactivity was the goal?
    • If browser has no CSP support, the base HTML/CSS/JS might violate constraints.
    • can right-click and open frame in new window
    • …but now you have another out of context view of data, with new URLs. Consider legal, copyright, fairuse, blah blah
    • have to maintain and run that second domain and hook it up to your main wiki
    • how to deal with per-instance data input? Pre-publish? postMessage just that in?
      • injecting data over postMessage maybe best for the InstantCommons-style scenario, since sites can use our scripts but inject data
    • probably easier debugging based on URLs
  • Subdomain per service provider, inject resources and instance data
    • Inject all HTML/SVG/JS/CSS at runtime via postMessage (trusting the parent site origin). Images/media could either be injected as blobs or whitelisted by URL.
    • The service provider could potentially be just a static HTML file served with certain strict CSP headers.
    • If injecting all resources, then could use a common provider for third-party wikis.
      • third-party wikis could host their own scripts using this technique using our frame broker. not sure if this is good idea or not!
    • No separate content files to host, nothing to take down in case of legal issues.
    • Downside: right-clicking a frame to open it in new window won’t give useful resources. Possible workarounds with providing a link-back in a location hash.
    • Script can check against a user-agent blacklist before offering to load stuff.
    • Downside: CSP header may need to be ‘loose’ to allow script injection, so could open you back up to XSS on parameters. But you’re not able to access outside the frame so pssssh!

Abuse and evil possibilities

Even with the security guarantees of origin restrictions and CSP, there are new and exciting threat models…

Simple denial of service is easy — looping scripts in an iframe can lock up the main UI thread for the tab (or whole browser, depending on the browser) until it eventually times out with an error. At which point it can potentially go right back into a loop. Or you can allocate tons of memory, slowing down and eventually perhaps crashing the browser. Even tiny programs can have huge performance impact, and it’s hard to predict what will be problematic. Thus script on a page could make it hard for other editors and admins to get back in to fix the page… For this reason I would  recommend against autoplay in Wikipedia articles of arbitrary unreviewed code.

There’s also possible trolling patterns: hide a shock image in a data set or inside a seemingly safe image file, then display it in a scriptable widget bypassing existing image review.

Advanced widgets could do all kinds of fun and educational things like run emulators for old computer and game systems. That brings with it the potential for copyright issues with the software being run, or for newer systems patent issues with the system being emulated.

For that matter you could run programs that are covered under software patents, such as decoding or encoding certain video file formats. I guess you could try that in Lua modules too, but JS would allow you to play or save result files to disk directly from the browser.

WP:BEANS may apply to further thoughts on this road, beware. ;)

Ideas from Jupyter: frontend/backend separation

Going back to Jupyter/IPython as an inspiration source; Jupyter has a separation between a frontend that takes interactive input and displays output, and a backend kernel which runs the actual computation server-side. To make for fancier interactive displays, the output can have a widget which runs some sort of JavaScript component in the frontend notebook page’s environment, and can interact with the user (via HTML controls), with other widgets (via declared linkages) and with the kernel code (via events).

We could use a model like this which distinguishes between trusted (or semi-trusted) frontend widget code which can do anything it can do in its iframe, but must be either pre-reviewed, or opted into. Frontend widgets that pass review should have well-understood behavior, good documentation, stable interfaces for injecting data, etc.

The frontend widget can and should still be origin-isolated & CSP-restricted for policy enforcement even if code is reviewed — defense in depth is important!

Such widgets could either be invoked from a template or lua module with a fixed data set, or could be connected to untrusted backend code running in an even more restricted sandbox.

The two main ‘more restricted sandbox’ possibilities are to run an interpreter that handles loops safely and applies resource limits, or to run in a worker thread that doesn’t block the main UI and can be terminated after a timeout…. but even that may be able to exhaust system resources via memory allocation.

I think it would be very interesting to extend Jupyter in two specific ways:

  • iframe-sandboxing the widget implementations to make loading foreign-defined widgets safer
  • implementing a client-side kernel that runs JS or Lua code in an interpreter, or JS in a sandboxed Worker, instead of maintaining a server connection to a Python/etc kernel

It might actually be interesting to adopt, or at least learn from, the communication & linkage model for the Jupyter widgets (which is backbone.js-based, I believe) and consider the possibilities for declarative linkage of widgets to create controllable diagrams/visualizations from common parts.

An interpreter-based Jupyter/IPython kernel that works with the notebooks model could be interesting for code examples on Wikipedia, Wikibooks etc. Math potential as well.

Short-term takeaways

  • Interpreters look useful in niche areas, but native JS in iframe+CSP probably main target for interactive things.
  • “Content widgets” imply new abuse vectors & thus review mechanisms. Consider short-term concentration on other areas of use:
    • sandboxing big JS libraries already used in things like Maps/Graphs/TimedMediaHandler that have to handle user-provided input
    • opt-in Gadget/user-script tools that can adapt to a “plugin”-like model
    • making those things invocable cross-wiki, including to third-party sites
  • Start a conversation about content widgets.
    • Consider starting with strict-review-required.
    • Get someone to make the next generation ‘Graphs’ or whatever cool tool as one of these instead of a raw MW extension…?
    • …slowly plan world domination.

Brain dump: x86 emulation in WebAssembly

This is a quick brain dump of my recent musings on feasibility of a WebAssembly-based in-browser emulator for x86 and x86-64 processors… partially in the hopes of freeing up my brain for main project work. ;)

My big side project for some time has been ogv.js, an in-browser video player framework which uses emscripten to cross-compile C libraries for the codecs into JavaScript or, experimentally, the new WebAssembly target. That got me interested in how WebAssembly works at the low level, and how C/C++ programs work, and how we can mishmash them together in ways never intended by gods or humans.

Specifically, I’m thinking it would be fun to make an x86-64 Linux process-level emulator built around a WebAssembly implementation. This would let you load a native Linux executable into a web browser and run it, say, on your iPad. Slowly. :)

System vs process emulation

System emulators provide the functioning of an entire computer system, with emulated software-hardware interfaces: you load up a full kernel-mode operating system image which talks to the emulated hardware. This is what you use for playing old video games, or running an old or experimental operating system. This can require emulating lots of detail behavior of a system, which might be tricky or slow, and programs may not integrate with a surrounding environment well because they live in a tiny computer within a computer.

Process emulators work at the level of a single user-mode process, which means you only have to emulate up to the system call layer. Older Mac users may remember their shiny new Intel Macs running old PowerPC applications through the “Rosetta” emulator for instance. QEMU on Linux can be set up to handle similar cross-arch emulated execution, for testing or to make some cross-compilation scenarios easier.

A process emulator has some attraction because the model is simpler inside the process… If you don’t have to handle interrupts and task switches, you can run more instructions together in a row; elide some state changes; all kinds of fun things. You might not have to implement indirect page tables for memory access. You might even be able to get away with modeling some function calls as function calls, and loops as loops!

WebAssembly instances and Linux processes

There are many similarities, which is no coincidence as WebAssembly is designed to run C/C++ programs similarly to how they work in Linux/Unix or Windows while being shoehornable into a JavaScript virtual machine. :)

An instantiated WebAssembly module has a “linear memory” (a contiguous block of memory addressable via byte indexing), analogous to the address space of a Linux process. You can read and write int and float values of various sizes anywhere you like, and interpretation of bytewise data is up to you.

Like a native process, the module can request more memory from the environment, which will be placed at the end. (“grow_memory” operator somewhat analogous to Linux “brk” syscall, or some usages of “mmap”.) Unlike a native process, usable memory always starts at 0 (so you can dereference a NULL pointer!) and there’s no way to have a “sparse” address space by mapping things to arbitrary locations.

The module can also have “global variables” which live outside this address space — but they cannot be dynamically indexed, so you cannot have arrays or any dynamic structures there. In WebAssembly built via emscripten, globals are used only for some special linking structures because they don’t quite map to any C/C++ construct, but hand-written code can use them freely.

The biggest difference from native processes is that WebAssembly code doesn’t live in the linear memory space. Function definitions have their own linear index space (which can’t be dynamically indexed: references are fixed at compile time), plus there’s a “table” of indirect function references (which can be dynamically indexed into). Function pointers in WebAssembly thus aren’t actually pointers to the instructions in linear memory like on native — they’re indexes into the table of dynamic function references.

Likewise, the call stack and local variables live outside linear memory. (Note that C/C++ code built with emscripten will maintain its own parallel stack in linear memory in order to provide arrays, variables that have pointers taken to them, etc.)

WebAssembly’s actual opcodes are oriented as a stack machine, which is meant to be easy to verify and compile into more efficient register-based code at runtime.

Branching and control flow

In WebAssembly control flow is limited, with one-way branches possible only to a containing block (i.e. breaking out of a loop). Subroutine calls are only to defined functions (either directly by compile-time reference, or indirectly via the function table)

Control flow is probably the hardest thing to make really match up from native code — which lets you jump to any instruction in memory from any other — to compiled WebAssembly.

It’s easy enough to handle craaaazy native branching in an interpreter loop. Pseudocode:

loop {
instruction = decode_instruction(ip)
instruction.execute() // update ip and any registers, etc
}

In that case, a JMP or CALL or whatever just updates the instruction pointer when you execute it, and you continue on your merry way from the new position.

But what if we wanted to eke more performance out of it by compiling multiple instructions into a single function? That lets us elide unnecessary state changes (updating instruction pointers, registers, flags, etc when they’re immediately overridden) and may even give opportunity to let the compiler re-optimize things further.

A start is to combine runs of instructions that end in a branch or system call (QEMU calls them “translation units”) into a compiled function, then call those in the loop instead of individual instructions:

loop {
tu = cached_or_compiled_tu(ip)
tu.execute() // update registers, ip, etc as we go
}

So instead of decoding and executing an instruction at a time, we’re decoding several instructions, compiling a new function that runs them, and then running that. Nice, if we have to run it multiple times! But…. possibly not worth as much as we want, since a lot of those instruction runs will be really short, and there’ll be function call overhead on every run. And, it seems like it would kill CPU branch prediction and such, by essentially moving all branches to a single place (the tu.execute()).

QEMU goes further in its dynamic translation emulators, modifying the TUs to branch directly to each other in runtime discovery. It’s all very funky and scary looking…

But QEMU’s technique of modifying trampolines in the live code won’t work as we can’t modify running code to insert jump instructions… and even if we could, there are no one-way jumps, and using call instructions risks exploding the call stack on what’s actually a loop (there’s no proper tail call optimization in WebAssembly).

Relooper

What can be done, though, is to compile bigger, better, badder functions.

When emscripten is generating JavaScript or WebAssembly from your C/C++ program’s LLVM intermediate language, it tries to reconstruct high-level control structures within each function from a more limited soup of local branches. These then get re-compiled back into branch soup by the JIT compiler, but efficiently. ;)

The binaryen WebAssembly code gen library provides this “relooper” algorithm too: you pass in blocks of instructions, possible branches, and the conditions around them, and it’ll spit out some nicer branch structure if possible, or an ugly one if not.

I’m pretty sure it should be possible to take a detected loop cycle of separate TUs and create a combined TU that’s been “relooped” in a way that it is more efficient.

BBBBuuuuutttttt all this sounds expensive in terms of setup. Might want to hold off on any compilation until a loop cycle is detected, for instance, and just let the interpreter roll on one-off code.

Modifying runtime code in WebAssembly

Code is not addressable or modifiable within a live module instance; unlike in native code you can’t just write instructions into memory and jump to the pointer.

In fact, you can’t actually add code to a WebAssembly module. So how are we going to add our functions at runtime? There are two tricks:

First, multiple module instances can use the same linear memory buffer.

Second, the tables for indirect function calls can list “foreign” functions, such as JavaScript functions or WebAssembly functions from a totally unrelated module. And those tables are modifiable at runtime (from the JavaScript side of the border).

These can be used to do full-on dynamic linking of libraries, but all we really need is to be able to add a new function that can be indirect-called, which will run the compiled version of some number of instructions (perhaps even looping natively!) and then return back to the main emulator runtime when it reaches a branch it doesn’t contain.

Function calls

Since x86 has a nice handy CALL instruction, and doesn’t just rely on convention, it could be possible to model calls to already-cached TUs as indirect function calls, which may perform better than exiting out to the loop and coming back in. But they’d probably need to be guarded for early exit, for several reasons… if we haven’t compiled the entirety of the relooped code path from start to exit of the function, then we have to exit back out. A guard check on IP and early-return should be able to do that in a fairly sane way.

function tu_1234() {
// loop
do {
// calc loop condition -> set zero_flag
ip = 1235
if !zero_flag {
break
}
ip = 1236
// CALL 4567
tu = cached_or_compiled_tu(4567)
tu.execute()
if ip != 1236 {
// only partway through. back to emulator loop,
// possibly unwinding a long stack :)
return
}
// more code
ip
}
}

I think this makes some kind of sense. But if we’re decoding instructions + creating output on the fly, it could take a few iterations through to produce a full compiled set, and exiting a loop early might be … ugly.

It’s possible that all this is a horrible pipe dream, or would perform too bad for JIT compilation anyway.

But it could still be fun for ahead-of-time compilation. ;) Which is complicated… a lot … by the fact that you don’t have the positions of all functions known ahead of time. Plus, if there’s dynamic linking or JIT compilation inside the process, well, none of that’s even present ahead of time.

Prior art: v86

I’ve been looking at lot at v86, a JavaScript-based x86 system emulator. v86 is a straight-up interpreter, with instruction decoding and execution mixed together a bit, but it feels fairly straightforwardly written and easy to follow when I look at things in the code.

v86 uses a set of aliased typed arrays for the system memory, another set for the register file, and then some variables/properties for misc flags and things.

Some quick notes:

  • a register file in an array means accesses at difference sizes are easy (al vs ax vs eax), and you can easily index into it from the operand selector bits from the instruction (as opposed to using a variable per register)
  • is there overhead from all the object property accesses etc? would it be more efficient to do everything within a big linear memory?
  • as a system emulator there’s some extra overhead to things like protected mode memory accesses (page tables! who knows what!) that could be avoided on a per-process model
  • 64-bit emulation would be hard in JavaScript due to lack of 64-bit integers (argh!)
  • as an interpreter, instruction decode overhead is repeated during loops!
  • to avoid expensive calculations of the flags register bits, most arithmetic operations that would change the flags instead save the inputs for the flag calculations, which get done on demand. This still is often redundant because flags may get immediately rewritten by the next instruction, but is cheaper than actually calculating them.

WebAssembly possibilities

First, since WebAssembly supports only one linear memory buffer at a time, the register file and perhaps some other data would need to live there. Most likely want a layout with the register file and other data at the beginning of memory, with the rest of memory after a fixed point belonging to the emulated process.

Putting all the emulator’s non-RAM state in the beginning means a process emulator can request more memory on demand via Linux ‘brk’ syscall, which would be implemented via the ‘grow_memory’ operator.

64-bit math

WebAssembly supports 64-bit integer memory accesses and arithmetic, unlike JavaScript! The only limitation is that you can’t (yet) export a function that returns or accepts an i64 to or from JavaScript-land. That means if we keep our opcode implementations in WebAssembly functions, they can efficiently handle 64-bit ops.

However WebAssembly’s initial version allows only 32-bit memory addressing. This may not be a huge problem for emulating 64-bit processes that don’t grow that large, though, as long as the executable doesn’t need to be loaded at a specific address (which would mean a sparse address space).

Sparse address spaces could be emulated with indirection into a “real” memory that’s in a sub-4GB space, which would be needed for a system emulator anyway.

Linux details

Statically linked ELF binaries would be easiest to model. More complex to do dynamic linking, need to pass a bundle of files in and do fix-ups etc.

Questions: are executables normally PIC as well as libraries, or do they want a default load address? (Which would break the direct-memory-access model and require some indirection for sparse address space.)

Answer: normally Linux x86_64 executables are not PIE, and want to be loaded at 0x400000 or maybe some other random place. D’oh! But… in the common case, you could simplify that as a single offset.

Syscall on 32-bit is ‘int $80’, or ‘syscall’ instruction on 64-bit. Syscalls would probably mostly need to be implemented on the JS side, poking at the memory and registers of the emulated process state and then returning.

To do network i/o would probably need to be able to block and return to the emulator… so like a function call bowing out early due to an uncompiled branch being taken, would potentially need an “early exit” from the middle of a combined TU if it does a syscall that ends up being async. On the other hand, if a syscall can be done sync, might be nice not to pay that penalty.

Could also need async syscalls for multi-process stuff via web workers… anything that must call back to main thread would need to do async.

For 64-bit, JS code would have to …. painfully … deal with 32-bit half-words. Awesome. ;)

Multiprocessing

WebAssembly initial version has no facility for multiple threads accessing the same memory, which means no threads. However this is planned to come in future…

Processes with separate address spaces could be implemented by putting each process emulator in a Web Worker, and having them communicate via messages sent to the main thread through syscalls. This forces any syscall that might need global state to be async.

Prior art: Browsix

Browsix provides a POSIX-like environment based around web techs, with processes modeled in Web Workers and syscalls done via async messages. (C/C++ programs can be compiled to work in Browsix with a modified emscripten.) Pretty sweet ideas. :)

I know they’re working on WebAssembly processes as well, and were looking into synchronous syscalls vi SharedArrayBuffer/Atomics as well, so this might be an interesting area to watch.

Could it be possible to make a Linux binary loader for the Browsix kernel? Maybe!

Would it be possible to make graphical Linux binaries work, with some kind of JS X11 or Wayland server? …mmmmmmaaaaybe? :D

Closing thoughts

This all sounds like tons of fun, but may have no use other than learning a lot about some low-level tech that’s interesting.

ogv.js 1.4.0 released

ogv.js 1.4.0 is now released, with a .zip build or via npm. Will try to push it to Wikimedia next week or so.

Live demo available as always.

New A/V sync

Main improvement is much smoother performance on slower machines, mainly from changing the A/V sync method to prioritize audio smoothness, based on recommendations I’d received from video engineers at conferences that choppy audio is noticed by users much more strongly than choppy or out of sync video.

Previously, when ogv.js playback detected that video was getting behind audio, it would halt audio until the video caught up. This played all audio, and showed all frames, but could be very choppy if performance wasn’t good (such as in Internet Explorer 11 on an old PC!)

The new sync method instead keeps audio rock-solid, and allows video to get behind a little… if the video catches back up within a few frames, chances are the user won’t even notice. If it stays behind, we look ahead for the next keyframe… when the audio reaches that point, any remaining late frames are dropped. Suddenly we find ourselves back in sync, usually with not a single discontinuity in the audio track.

fastSeek()

The HTMLMediaElement API supports a fastSeek() method which is supposed to seek to the nearest keyframe before the request time, thus getting back to playback faster than a precise seek via setting the currentTime property.

Previously this was stubbed out with a slow precise seek; now it is actually fast. This enables a much better “scrubbing” experience given a suitable control widget, as can be seen in the demo by grabbing the progress thumb and moving it around the bar.

VP9 playback

WebM videos using the newer, more advanced VP9 codec can use a lot less bandwidth than VP8 or Theora videos, making it attractive for streaming uses. A VP9 decoder is now included for WebM, initially supporting profile 0 only (other profiles may or may not explode) — that means 8-bit, 4:2:0 subsampling.

Other subsampling formats will be supported in future, can probably eventually figure out something to do with 10-bit, but don’t expect those to perform well. :)

The VP9 decoder is moderately slower than the VP8 decoder for equivalent files.

Note that WebM is still slightly experimental; the next version of ogv.js will make further improvements and enable it by default.

WebAssembly

Firefox and Chrome have recently shipped support for code modules in the WebAssembly format, which provides a more efficient binary encoding for cross-compiled code than JavaScript. Experimental wasm versions are now included, but not yet used by default.

Multithreaded video decoding

Safari 10.1 has shipped support for the SharedArrayBuffer and Atomics APIs which allows for fully multithreaded code to be produced from the emscripten cross-compiler.

Experimental multithreaded versions of the VP8 and VP9 decoders are included, which can use up to 4 CPU cores to significantly increase speed on suitably encoded files (using the -slices option in ffmpeg for VP8, or -tile_columns for VP9). This works reasonably well in Safari and Chrome on Mac or Windows desktops; there are performance problems in Firefox due to deoptimization of the multithreaded code.

This actually works in iOS 10.3 as well — however Safari on iOS seems to aggressively limit how much code can be compiled in a single web page, and the multithreading means there’s more code and it’s copied across multiple threads, leading to often much worse behavior as the code can end up running without optimization.

Future versions of WebAssembly should bring multithreading there as well, and likely with better performance characteristics regarding code compilation.

Note that existing WebM transcodes on Wikimedia Commons do not include the suitable options for multithreading, but this will be enabled on future builds.

Misc fixes

Various bits. Check out the readme and stuff. :)

What’s next for ogv.js?

Plans for future include:

  • replace the emscripten’d nestegg demuxer with Brian Parra’s jswebm
  • fix the scaling of non-exact display dimensions on Windows w/ WebGL
  • enable WebM by default
  • use wasm by default when available
  • clean up internal interfaces to…
  • …create official plugin API for demuxers & decoders
  • split the demo harness & control bar to separate packages
  • split the decoder modules out to separate packages
  • Media Source Extensions-alike API for DASH support…

Those’ll take some time to get all done and I’ve got plenty else on my plate, so it’ll probably come in several smaller versions over the next months. :)

I really want to get a plugin interface so people who want/need them and worry less about the licensing than me can make plugins for other codecs! And to make it easier to test Brian Parra’s jsvpx hand-ported VP8 decoder.

An MSE API will be the final ‘holy grail’ piece of the puzzle toward moving Wikimedia Commons’ video playback to adaptive streaming using WebM VP8  and/or VP9, with full native support in most browsers but still working with ogv.js in Safari, IE, and Edge.

Testing in-browser video transcoding with MediaRecorder

A few months ago I made a quick test transcoding video from MP4 (or whatever else the browser can play) into WebM using the in-browser MediaRecorder API.

I’ve updated it to work in Chrome, using a <canvas> element as an intermediary recording surface as captureStream() isn’t available on <video> elements yet there.

Live demo: https://brionv.com/misc/browser-transcode-test/capture.html

There are a couple advantages of re-encoding a file this way versus trying to do all the encoding in JavaScript, but also some disadvantages…

Pros

  • actual encoding should use much less CPU than JavaScript cross-compile
  • less code to maintain!
  • don’t have to jump through hoops to get at raw video or audio data

Cons

  • MediaRecorder is realtime-oriented:
    • will never decode or encode faster than realtime
    • if encoding is slower than realtime, lots of frames are dropped
    • on my MacBook Pro, realtime encoding tops out around 720p30, but eg phone camera videos will often be 1080p30 these days.
  • browser must actually support WebM encoding or it won’t work (eg, won’t work in Edge unless they add it in future, and no support at all in Safari)
  • Firefox and Chrome both seem to be missing Vorbis audio recording needed for base-level WebM (but do let you mix Opus with VP8, which works…)

So to get frame-rate-accurate transcoding, and to support higher resolutions, it may be necessary to jump through further hoops and try JS encoding.

I know this can be done — there are some projects compiling the entire ffmpeg  package in emscripten and wrapping it in a converter tool — but we’d have to avoid shipping an H.264 or AAC decoder for patent reasons.

So we’d have to draw the source <video> to a <canvas>, pull the RGB bits out, convert to YUV, and run through lower-level encoding and muxing… oh did I forget to mention audio? Audio data can be pulled via Web Audio, but only in realtime.

So it may be necessary to do separate audio (realtime) and video (non-realtime) capture/encode passes, then combine into a muxed stream.

Canvas, Web Audio, MediaStream oh my!

I’ve often wished that for ogv.js I could send my raw video and audio output directly to a “real” <video> element for rendering instead of drawing on a <canvas> and playing sound separately to a Web Audio context.

In particular, things I want:

  • Not having to convert YUV to RGB myself
  • Not having to replicate the behavior of a <video> element’s sizing!
  • The warm fuzzy feeling of semantic correctness
  • Making use of browser extensions like control buttons for an active video element
  • Being able to use browser extensions like sending output to ChromeCast or AirPlay
  • Disabling screen dimming/lock during playback

This last is especially important for videos of non-trivial length, especially on phones which often have very aggressive screen dimming timeouts.

Well, in some browsers (Chrome and Firefox) now you can do at least some of this. :)

I’ve done a quick experiment using the <canvas> element’s captureStream() method to capture the video output — plus a capture node on the Web Audio graph — combining the two separate streams into a single MediaStream, and then piping that into a <video> for playback. Still have to do YUV to RGB conversion myself, but final output goes into an honest-to-gosh <video> element.

To my great pleasure it works! Though in Firefox I have some flickering that may be a bug, I’ll have to track it down.

Some issues:

  • Flickering on Firefox. Might just be my GPU, might be something else.
  • The <video> doesn’t have insight to things like duration, seeking, etc, so can’t rely on native controls or API of the <video> alone acting like a native <video> with a file source.
  • Pretty sure there are inefficiencies. Have not tested performance or checked if there’s double YUV->RGB->YUV->RGB going on.

Of course, Chrome and Firefox are the browsers I don’t need ogv.js for for Wikipedia’s current usage, since they play WebM and Ogg natively already. But if Safari and Edge adopt the necessary interfaces and WebRTC-related infrastructure for MediaStreams, it might become possible to use Safari’s full screen view, AirPlay mirroring, and picture-in-picture with ogv.js-driven playback of Ogg, WebM, and potentially other custom or legacy or niche formats.

Unfortunately I can’t test whether casting to a ChromeCast works in Chrome as I’m traveling and don’t have one handy just now. Hoping to find out soon! :D

JavaScript async/await fiddling

I’ve been fiddling with using ECMAScript 2015 (“ES6”) in rewriting some internals for ogv.js, both in order to make use of the Promise pattern for asynchronous code (to reduce “callback hell”) and to get cleaner-looking code with the newer class definitions, arrow functions, etc.

To do that, I’ll need to use babel to convert the code to the older ES5 version to run in older browsers like Internet Explorer and old Safari releases… so why not go one step farther and use new language features like asynchronous functions that are pretty solidly specced but still being implemented natively?

Not yet 100% sure; I like the slightly cleaner code I can get, but we’ll see how it functions once translated…

Here’s an example of an in-progress function from my buffering HTTP streaming abstraction, currently being rewritten to use Promises and support a more flexible internal API that’ll be friendlier to the demuxers and seek operations.

I have three versions of the function: one using provisional ES2017 async/await, one using ES2015 Promises directly, and one written in ES5 assuming a polyfill of ES2015’s Promise class. See the full files or the highlights of ES2017 vs ES2015:

The first big difference is that we don’t have to start with the “new Promise((resolve,reject) => {…})” wrapper. Declaring the function as async is enough.

Then we do some synchronous setup which is the same:

Now things get different, as we perform one or two asynchronous sub-operations:

In my opinion the async/await code is cleaner:

First it doesn’t have as much extra “line noise” from parentheses and arrows.

Second, I can use a try/finally block to do the final state change only once instead of on both .then() and .catch(). Many promise libraries will provide an .always() or something but it’s not standard.

Third, I don’t have to mentally think about what the “intermittent return” means in the .then() handler after the triggerDownload call:

Here, returning a promise means that that function gets executed before moving on to the next .then() handler and resolving the outer promise, whereas not returning anything means immediate resolution of the outer promise. It ain’t clear to me without thinking about it every time I see it…

Whereas the async/await version:

makes it clear with the “await” keyword what’s going on.

Updated: I managed to get babel up and running; here’s a github gist with expanded versions after translation to ES5. The ES5 original is unchanged; the ES2015 promise version is very slightly more verbose, and the ES2017 version becomes a state machine monstrosity. ;) Not sure if this is ideal, but it should preserve the semantics desired.