emscripten fun: porting libffi to WebAssembly part 1

I have a silly dream of seeing graphical Linux/FOSS programs running portably on any browser-based system outside of app store constraints. Two of my weekend side projects are working in this direction: an x86 emulator core to load and run ELF binaries, and better emscripten cross-compilation support for the GTK+ stack.

Emulation ideas

An x86 emulator written in WebAssembly could run pre-built Linux binaries, meaning in theory you could make an automated packager for anything in a popular software repository.

But even if all the hard work of making a process-level emulator work, and hooking up the Linux-level libraries to emulated devices for i/o and rendering, there are some big performance implications, and you’re probably also bundling lots of library code you don’t need at runtime.

Instruction decoding and dispatch will be slow, much slower than native. And it looks pretty complex to do JIT-ing of traces. While I think it could be made to work in principle, I don’t think it’ll ever give a satisfactory user experience.

Cross-compilation

Since we’ve got the source of Linux/FOSS programs by definition, cross-compiling them directly to WebAssembly will give far better performance!

In theory even something from the GNOME stack would work, given an emscripten-specific gdk backend rendering to a WebGL canvas just like games use SDL2 or similar to wrap i/o.

But long before we can get to that, there are low-level library dependencies.

Let’s start with glib, which implements a bunch of runtime functions and the GObject type system, used throughout the stack.

Glib needs libffi, a library for calling functions with at-runtime call signatures and creating closure functions which enclose a state variable around a callback.

In other words, libffi needs to do things that you cannot do in standard C, because it needs system-specific information about how function arguments and return values are transferred (part of the ABI, application binary interface). And to top it off, in many cases (emscripten included) you still can’t do it in C, because asm.js and WebAssembly provide no way to make a call with an arbitrary argument list. So, like a new binary platform, libffi must be ported…

It seems to be doable by bumping up to JavaScript, where you can construct an array of mixed-type arguments and use Function.prototype.apply to call the target function. Using an EM_ASM_ block in my shiny new wasm32/ffi.c I was able to write a JavaScript implementation of the guts of ffi_call which works for int, float, and double parameters (still have to implement 64-bit ints and structs).

The second part of libffi is the closure creation API, which I think requires creating a closure function in the JavaScript side, inserting it into the module’s function tables, and then returning the appropriate index as its address. This should be doable, but I haven’t started yet.

Emscripten function pointers

There are two output targets for the emscripten compiler: asm.js JavaScript and WebAssembly. They have similar capabilities and the wrapper JS is much the same in both, but there are some differences in implementation and internals as well as the code format.

One is in function tables for indirect calls. In both cases, the low-level asm.js/WASM code can’t store the actual pointer address of a function, so they use an index into a table of functions. Any function whose address is taken at compile time is added to the table, and its index used as the address. Then, when an indirect call through a “function pointer” is made, the pointer is used as the index into the function table, and an actual function call is made on it. Voila!

In asm.js, there are lots of Weird Things done about type consistency to make the JavaScript compilers fast. One is that the JS compiler gets confused if you have an array of functions that _contain different signatures_, making indirect calls run through slower paths in final compiled code. So for each distinct function signature (“returns void” or “returns int, called with float32” etc) there was a separate array. This also means that function pointers have meaning only with their exact function signature — if you take a pointer to a function with a parameter and call it without one, it could end up calling an entirely different function at runtime because that index has a different meaning in that signature!

In WebAssembly, this is handled differently. Signatures are encoded at the call site in the call_indirect opcode, so no type inference needs to be done.

But.

At least currently, the asm.js-style table separation is still being used, with the multiple tables encoded into the single WebAssembly table with a compile-time-known constant offset.

In both cases, the JS side can do indirect calls by calculating the signature string (“vf” for “void return, float32 arg” etc) and calling the appropriate “dynCall_vf” etc method, passing first the pointer and then the rest of the argument list. On asm.js this will look up in the tables directly; on WASM it’ll apply the index.

(etc)

It’s possible that emscripten will change the WASM mode to use a single array without the constant offset indirection. This will simplify lookups, and I think make it easier to add more functions at runtime.

Because you see, if you want to add a callback at runtime, like libffi’s closure API wants to, then you need to add another entry to that table. And in asm.js the table sizes are fixed for asm.js validation rules, and in WASM current mode the sub-tables are definitely fixed at compile time, since those constant offsets are used throughout.

So currently there’s an option you can use at build time to reserve room for runtime function pointers, I think I’ll have to use it, but that only reserves *fixed space* of a given number of pointers.

Next

Coming up next time: int64 and struct in the emscripten ABI, and does the closure API work as expected?

String concatenation garbage collection madness!

We got a report of a bug with the new 3D model viewing extension on Wikimedia Commons, where a particular file wasn’t rendering a thumbnail due to an out-of-memory condition. The file was kind-of big (73 MiB) but not super huge, and should have been well within the memory limits of the renderer process.

On investigation, it turned out to be a problem with how three.js’s STLLoader class was parsing the ASCII variant of the file format:

  • First, the file is loaded as a binary ArrayBuffer
  • Then, the buffer is checked to see whether it contains binary or text-format data
  • If it’s text, the entire buffer is converted to a string for further processing

That conversion step had code that looked roughly like this:

var str = '';
for (var i = 0; i < arr.length; i++) {
    str += String.fromCharCode(arr[i]);
}
return str;

Pretty straightforward code, right? Appends one character to the string until the input binary array is out, then returns it.

Well, JavaScript strings are actually immutable — the “+=” operator is just shorthand for “str = str + …”. This means that on every step through the new loop, we create two new strings: one for the character, and a second for the concatenation of the previous string with the new character.

The JavaScript virtual machine’s automatic garbage collection is supposed to magically de-allocate the intermediate strings once they’re no longer referenced (at some point after the next run through the loop) but for some reason this isn’t happening in Node.js. So when we run through this loop 70-some million times, we get a LOT of intermediate strings still in memory and eventually the VM just dies.

Remember this is before any of the 3d processing — we’re just copying bytes from a binary array to a string, and killing the VM with that. What!?

Newer versions of the STLLoader use a more efficient path through the browser’s TextDecoder API, which we can polyfill in node using Buffer, making it blazing fast and memory-efficient… this seems to fix the thumbnailing for this file in my local testing.

Just for fun though I thought, what would it take to get it working in Node or Chrome without the fancy native API helpers? Turns out you can significantly reduce the memory usage of this conversion just by switching the order of operations….

The original append code results in operations like: (((a + b) + c) + d) which increases the size of the left operand linearly as we go along.

If we instead do it like ((a + b) + (c + d)) we’ll increase _both_ sides more slowly, leading to much smaller intermediate strings clogging up the heap.

Something like this, with a sort of binary bisection thingy:

function do_clever(arr, start, end) {
    if (start === end) {
        return '';
    } else if (start + 1 === end) {
        return String.fromCharCode(arr[start]);
    } else {
        var mid = start + Math.floor((end - start) / 2);
        return do_clever(arr, start, mid) +
               do_clever(arr, mid, end);
    }
}

return do_clever(arr, 0, arr.length);

Compared to the naive linear append, I’m able to run through the 73 MiB file in Node, and it’s a bit faster too.

But it turns out there’s not much reason to use that code — most browsers have native TextDecoder (even faster) and Node can fake it with another native API, and those that don’t are Edge and IE, which have a special optimization for appending to strings.

Yes that’s right, Edge 16 and IE 11 actually handle the linear append case significantly faster than the clever version! It’s still not _fast_, with a noticeable delay of a couple seconds on IE especially, but it works.

So once the thumbnail fix goes live, that file should work both in the Node thumbnailer service *and* in browsers with native TextDecoder *and* in Edge and IE 11. Yay!

ogv.js 1.5.7 released

I’ve released ogv.js 1.5.7 with performance boosts for VP8/VP9 video (especially on IE 11) and Opus audio decoding, and a fix for audio/webm seeking.

npm: https://www.npmjs.com/package/ogv
zip: https://github.com/brion/ogv.js/releases/tag/1.5.7

emscripten versus IE 11: arithmetic optimization for ogv.js

ogv.js is a web video & audio playback engine for supporting the free & open Ogg and WebM formats in browsers that don’t support them natively, such as Safari, Edge, and Internet Explorer. We use it at Wikipedia and Wikimedia Commons, where we don’t currently allow the more common MP4 family of file formats due to patent concerns.

IE 11, that old nemesis, still isn’t quite gone, and it’s definitely the hardest to support. It’s several years old now, with all new improvements going only into Edge on Windows 10… so no WebAssembly, no asm.js optimizations, and in general it’s just kind of ….. vveerryy ssllooww compared to any more current browser.

But for ogv.js I still want to support it as much as possible. I found that for WebM videos using the VP8 or VP9 codecs, there was a *huge* slowdown in IE compared to other browsers, and wanted to see if I could pick off some low-hanging fruit to at least reduce the gap a bit and improve playback for devices right on the edge of running smoothly at low resolutions…

Profiling in IE is a bit tough since the dev tools often skew JS performance in weird directions… but always showed that a large bottleneck was the Math.imul polyfill.

Math.imul, on supporting browsers, is a native function that implements 32-bit integer multiplication correctly and very very quickly, including all the weird overflow conditions that can result from multiplying large numbers — this is used in the asm.js code produced by the emscripten compiler to make sure that multiplication is both fast and correct.

But on IE 11 it’s not present, so a replacement function (“polyfill”) is used by emscripten instead. This does several bit shifts, a couple multiplications, blah blah details, anyway even when the JIT compiler inlines the function, it’s slower than necessary.

I hacked together a quick test to search the generated asm.js code for calls to the minified reference to Math.imul, and replace them with direct multiplication… and found significant performance improvements!

I also found it broke some of the multiplications by using wrong order of operations though, so replaced it with a corrected transformation that instead of a regex on the code, uses a proper JS parser, walks the tree for call sites, and replaces them with direct multiplication… after some more confusion with my benchmarking, I confirmed that the updated code was still faster:

This is about a 15-20% improvement, plus or minus, which seems a pretty significant bump!

Of course more modern browsers like current versions of Safari and Edge will use the Web Assembly version of ogv.js anyway, and are several times faster…

 

Daydream View phone VR headset

I somehow ended up with a $100 credit at the Google Store, and decided to splurge on something I didn’t need but wanted — the Daydream View phone VR headset.

This is basically one step up from the old Google Cardboard viewer where you put your phone in a lens/goggles getup, but it actually straps to your head so you don’t have to hold it to your face manually. It’s also got a wireless controller instead of just a single clicky button, so you get a basic pointer within the VR world with selection & home buttons and even a little touchpad, allowing a fair range of controls within VR apps.

Using the controller, the Daydream launcher interface also provides an in-VR UI for browsing Google Play Store and purchasing new games/apps. You can even broadcast a 2d view to a ChromeCast device to share your fun with the rest of the room, though there’s no apparent way to save a recording.

The hardware is still pretty limiting compared to what the PC VR headsets like the Rift and Vive can do — there’s no positional tracking for either head tracking or the controller, so you’re limited to rotation. This means things in the VR world don’t move properly as you move your head around, and you can only use the controller to point, not to pick up VR objects in 3d space. And it’s still running on a phone (Pixel 2 in my case) so you’re not going to get the richest graphics — most apps I’ve tried so far have limited, cartoony graphics.

Still it’s pretty fun to play with; I’ll probably play through a few of these games and then put it away mostly forever, unless I figure out some kind of Wikipedia-related VR project to work on on the side. :)

Old Xeon E5520 versus PS3 emulator RPCS3

I’ve been fixing up my old Dell Precision T5500 workstation, which had been repurposed to run (slightly older) games, to run both work stuff & more current games if possible. Although the per-core performance of a 2009-era Xeon E5520 processor is not great, with 8 cores / 16 threads total (dual CPUs!) it still packs a punch on multithreaded workloads compared to a laptop.

When I stumbled on the RPCS3 Playstation 3 emulator, I just had to try it too and see what happened… especially since I’ve been jonesing for a Katamari fix since getting rid of my old PS3 a couple years ago!

The result is surprisingly decent graphics, but badly garbled audio:

My current theory based on reading a bunch in their support forums is that the per-thread performance is too bad to run the thread doing audio processing, so it’s garbling/stuttering at short intervals. Between the low clock speed (2.26 GHz / 2.5 GHz boost) and the older processor tech AND the emulation overhead, one Xeon core is probably not going to be as fast as one PS3 Cell SPE unit, so if that thread was just close enough to full to work on the PS3 it’ll be too slow on my PC…

Windows’ Task Manager shows a spread of work over 8-9 logical processors, but not fully utilized. Threads that are woken/sleeped frequently like audio or video processing tend to get broken up on this kind of graph (switching processors on each wake), so you can’t tell if one *OS* thread is maxed out as easily from the graph.

This all leads me to believe the emulator’s inherently CPU-bound here, and really would do better with a more modern 4-core or 6-core CPU in the 3ish-to-4ish GHz range. I’ll try some of the other games I have discs still lying around for just for fun, but I expect similar results.

This is probably something to shelve until I’ve got a more modern PC, which probably means a big investment (either new CPU+RAM+motherboard or just a whole new PC) so no rush. :)

2009 Workstation recovery – libvpx benchmark

I’m slowly upgrading my old 2009-era workstation into the modern world. It’s kind of a fun project! Although the CPUs are a bit long in the tooth, with 8 cores total it still can run some tasks faster than my more modern MacBook Pro with just 2 cores.

Enjoy some notes from benchmarking VP9 video encoding with libvpx and ffmpeg on the workstation versus the laptop…

(end)

Light field rendering: a quick diagram

Been reading about light field rendering for free movement around 3d scenes captured as images. Made some diagrams to help it make sense to me, as one does…
 
There’s some work going on to add viewing of 3d models to Wikipedia & Wikimedia Commons — which is pretty rad! — but geometric meshes are hard to home-scan, and don’t represent texture and lighting well, whereas light fields capture this stuff fantastically. Might be interesting some day to play with a light field scene viewer that provides parallax as you rotate your phone, or provides a 3d see-through window in VR views. The real question is whether the _scanning_ of scenes can be ‘democratized’ using a phone or tablet as a moving camera, combined with the spatial tracking that’s used for AR games to position the captures in 3d space…
 
Someday! No time for more than idle research right now. ;)
Here, enjoy the diagrams.

If you get grounded in virtual reality, you get grounded in real life

I keep researching the new VR stuff and thinking “this could be fun, but it’s still too expensive and impractical”. Finally took a quick jaunt to the Portland Microsoft store this morning hoping for a demo of the HTC Vive VR headset to see how it feels in practice.

Turns out they don’t have the Vive set up for demos right now because it requires a lot of setup and space for the room-scale tracking, so instead I did a quick demo with the Oculus Rift, which doesn’t require as much space set aside.

First impressions:
* They make you sign a waiver in case of motion sickness…
* Might have been able to fit over my glasses, but I opted to just go without (my uncorrected blurry vision is still slightly better than the resolution of the Rift anyway)
* Turned it on – hey that’s pretty cool!
* Screen door effect from the pixel grid is very visible at first but kinda goes away.
* Limited resolution is also noticeable but not that big a deal for the demo content. I imagine this is a bigger problem for user interfaces with text though.
* Head tracking is totally smooth and feels natural – just look around!
* Demos where I sit still and either watch something or interact with the controllers were great.
* Complete inability to see real world gave a feeling of helplessness when had to, say, put on the controllers…
* Once controllers in hand, visualization of hand/controller helped a lot.
* Shooting gallery demo was very natural with the rift controllers.
* Mine car roller coaster demo instantly made me nauseated; I couldn’t have taken more than a couple minutes of that.

For FPS-style games and similar immersion, motion without causing motion sickness is going to be the biggest problem — compared to a fixed screen the VR brain is much more sensitive to mismatches between visual cues and your inner ear’s accelerometer…

I think I’m going to wait on the PC VR end for now; it’s a young space, the early sets are expensive, and I need to invest in a new more powerful PC anyway. Microsoft is working on some awesome “mixed reality” integration for Windows 10, which could be interesting to watch but the hardware and software are still in flux. Apple is just starting to get into it, mostly concentrating (so far as we know) on AR views on iPhones and iPads, but that could become something else some day.

Google’s Daydream VR platform for Android is interesting, but I need a newer phone for that — and again I probably should wait for the Pixel 2 later this year rather than buy last year’s model just to play with VR.

So for the meantime, I ordered a $15 Google Cardboard viewer that’ll run some VR apps on my current phone, as long as I physically hold it up to my face. That should tide me over with some games and demos, and gives me the chance to experiment with making my own 3D scenes via either Unity (building to a native Android app) or something like BabylonJS (running in Chrome with WebGL/WebVR support).