Overflowing stacks in WebAssembly, whoops!

Native C-like programming environments tend to divide memory into several regions:

  • static data contains predefined constants loaded from the binary, and space for global variables (“data”)
  • your actual code has to go in memory too! (“text”)
  • space for dynamic memory allocation (“heap”), which may grow “up” as more memory is needed
  • space for temporary data and return addresses for function calls (“stack”), which may grow “down” as more memory is needed

Usually this is laid out with the stack high in the address space and the heap lower in the address space, if I recall correctly? Allocating more heap is done when you need it via malloc, and the stack can either grow or warn of an overflow by using the CPU’s memory manager to detect use of data pages incremented beyond the edge of the stack.

In emscripten’s WebAssembly porting environment, things are similar but a little different:

  • code doesn’t live in linear memory, so functions don’t have memory addresses
  • because code return addresses and small local variables also live separately, only arrays/structs and vars with address taken must be on stack.
  • usable memory is continguous; you can’t have a sparse address space where stack and heap can both grow.

As a result, the stack is fixed size and there’s some fragility, but the stack uses less space usually.

Currently the memory layout is to start with static data, follow with the stack, and then the heap. The stack grows “down”, meaning when you reach the end of the stack you end up in static data territory and can overwrite global variables. This leads to weird, hard to detect errors.

When making debug builds with assertions there are some “cookie” checks to ensure that some specific locations at the edge of the stack have not been overwritten at various times, but this doesn’t always catch things if you only overwrote the beginning of a buffer in static data and not the part that had the cookie. :) It also doesn’t seem to trigger on library workflows where you’re not running through the emscripten HTML5/SDL/WebGL runtime.

There’s currently a PR open to reduce the default stack size from 5 MiB to 0.5 MiB, which reduces the amount of memory needed for small modules significantly, and we’re chatting a bit about detecting errors in the case that codebases have regressions…

One thing that’s come up is the possibility of moving the stack to before static data, so you’d have: stack, static data, heap.

This has two consequences:

  • any memory access beyond the stack end will wrap around 0 into unallocated memory at the top of the address space, causing an immediate trap — this is done by the memory manager for free, with no false positives or negatives
  • literal references to addresses in static data will be larger numbers, thus may take more bytes to encode in the binary (variable-length encoding is used in WebAssembly for constants)

Probably the safety and debugging win would be a bigger benefit than the size savings, though potentially that could be a win for size-conscience optimizations when not debugging.

Rust error handling with Result and Option (WebAssembly ABI)

In our last adventure we looked at C++ exceptions in WebAssembly with the emscripten compiler. Now we’re taking a look at the main error handling system for another language targeting WebAssembly, Rust.

Rust has a “panic”/”unwind” system similar to C++ exceptions, but it’s generally recommended against catching panics. It also currently doesn’t work in WebAssembly output, where Rust doesn’t use emscripten’s tricks of calling out to JavaScript and is waiting for proper exception integration to arrive in Wasm.

Instead, most user-level Rust code expresses fallible operations using the Result or Option enum types. An Option<T> can hold either some value of type T — Some(T) — or no value — None. A Result<T, E> is fancier and can hold either a data payload Ok(T) or a Err(E) holding some information about the error condition. The E type can be a string or an enum carrying a description of the error, or it can just be an empty () unit type if you’re not picky.

Rust’s “?” operator is used as a shortcut for checking and propagating these error values.

For instance a function to store to memory (which might fail if the address is invalid or non-writable) you could write:

    // Looking up a memory page for writing could fail,
    // if the page is unmapped or read-only.
    pub fn page_for_write(&mut self, addr: GuestPtr)
    -> Result<&mut PageData, Exception> {
       // ... returns either Ok(something)
       // or Err(Exception::PageFault)
    }

    pub fn store_8(&mut self, addr: GuestPtr, val: u8)
    -> Result<(), Exception> {

        // First some address calculations which cannot fail.
        let index = addr_to_index(addr);
        let page = addr_to_page(addr);

        let mem = self.page_for_write(page)?;

        // The "?" operator checked for an Err(_) condition
        // and if so, propagated it to the caller. If it was
        // Ok(_) then we continue on with that value.
        mem.store_8(index, val);

        Ok(())
    }

This looks reasonably clean in the source — you have some ?s sprinkled about but they indicate that a branch may occur, which is good to help reason about the performance. This makes things more predictable when you look at it — addition of a destructor in C++ might make a fast path suddenly slow with emscripten’s C++/JavaScript hybrid exceptions, while the worst that happens here is a visible flag check which is not so bad, right?

Well, mostly. The beautiful, yet horrible thing about Rust is that automatic optimization lets you create these wonderful clean conceptual APIs that compile down to fast code. The horrible part is that you sometimes don’t know how or why it’s doing what, and there can still be surprises!

The good news is that there’s never any JavaScript call-outs in your hot paths, unlike emscripten’s C++ exception catching. Yay! But the bad news is that in WebAssembly, returning a structure like a Result<T,E> or an Option<T> that doesn’t resolve into a single word can be complicated. And knowing what resolves is complicated. And sometimes things get packed into integers and I don’t even know what’s doing that.

Crash course on enum layout

Rust “enums” can be much more fancy than C/C++ “enums”, with the ability not only to carry a discriminator of which enumerated value they carry, but to carry structure-like data payloads within them.

For Option<T>, this means you have a 1-bit discriminant between Some(_) or None and then the storage payload of the T type. If T is itself a non-zero or non-nullable type, or an enum with free space (a “niche”) then it can be optimized into a single word — for instance an Option<&u8> takes only a single pointer word of space because null references are not possible, while Option<*const u8> would take two words. When we use Result<(), E> for a return value on fallible operations that don’t need to return data, like the store_8 function above, we get that same benefit.

When you can return a single word, life is simple. At the low-level JIT implementation it’s probably passed in a register, and everything is awesome and fast.

However we have two words on data-bearing Result<T,E>s, and WebAssembly can’t return two words at once from a function. Instead, most of the time, the function is transformed to not use the return value, and instead send the address of a stack location to write the structure to memory, which means two memory stores and at least one memory read to check the discriminant value. Plus you have to manipulate the stack pointer, which you might not have needed otherwise on some functions.

This is not great, but is probably better than calling out to JavaScript and back. I haven’t yet benchmarked tight loops with this.

Packed enums in single words

I’ve also sometimes seen Result enums get packed into a single word, and I don’t yet understand the circumstances that this happens.

On a WebAssembly build, I’ve had it happen with Result<u16, E> where E is a small fieldless enum: it’s packed into a single 32-bit integer. But Result<u8, E> is sent through the stack, and Result<u32, E> is too, though it could have been packed into a single 64-bit integer.

On native x86-64 builds on macOS, I see the word packing for both Result<u16, E> and Result<u32, E>… meanwhile Result<u64, E> uses the stack, but Result<u8, E> uses two words returned in different registers.

I know the internals of Rust layout and calling conventions are officially undocumented and can change, but it’d be nice to know roughly what some of them are. ;)

Further information on Rust layout and ABI

Exception handling in emscripten: how it works and why it’s disabled by default

One of my “fun” projects I’m poking at on the side is an x86-64 emulator for the web using WebAssembly (not yet online, nowhere near working). I’ve done partial spike implementations in C++ and Rust with a handful of interpreted opcodes and emulation of a couple Linux syscalls; this is just enough for a hand-written “Hello, world!” executable. ;)

One of the key questions is how to handle exceptional conditions within the emulator, such as an illegal instruction, page fault, etc… There are multiple potentially fallible operations within the execution of a single opcode, so the method chosen to communicate those exceptions could make a big impact on performance.

For C++, one method which looks nice in the source code and can be good on native platforms is to use C++’s normal exception facility. This turns out to be somewhat problematic in emscripten builds for WebAssembly, which is explained by looking at its implementation.

In fact it’s considered enough of a performance problem that catching exceptions is disabled by default in emscripten — throwing will abort the program with no way to catch it except at the surrounding JavaScript host page level. To get exceptions to actually work, you must pass -s DISABLE_EXCEPTION_CATCHING=0 to the compiler.

Throwing is easy

Let’s write ourselves a potentially-throwing function:

static volatile int should_explode = 0;

void kaboom() {
    if (should_explode) {
        throw "kaboom";
    }
}

Pretty exciting! The volatile int is just to make sure the optimizer doesn’t do anything clever for us, as real world functions don’t _always_ or _never_ throw, they _sometimes_ throw. At -O1 this compiles to WebAssembly something like this:

 (func $kaboom\28\29 (; 26 ;) (type $11)
  (local $0 i32)
  (if
   (i32.load
    (i32.const 13152)
   )
   (block
    (i32.store
     (local.tee $0
      (call $__cxa_allocate_exception
       (i32.const 4)
      )
     )
     (i32.const 1024)
    )
    (call $__cxa_throw
     (local.get $0)
     (i32.const 12432)
     (i32.const 0)
    )
    (unreachable)
   )
  )
 )

The function $__cxa_throw calls into emscripten’s JavaScript-side runtime to throw a native JS exception.

That’s right, WebAssembly doesn’t have a way to throw — or catch — an exception itself. Yet. They’re working on it, but it’s not done yet.

Calling best case

So now we want to call that function. In the best case scenario, there’s a function that doesn’t have any cleanup work to do after a potentially throwing call: no try/catch clauses, no destructors, no stack allocations to clean up.

void dostuff_nocatch() {
    printf("Doing nocatch stuff\n");
    kaboom();
    printf("Possibly unreachable stuff\n");
}

This compiles to nice, simple, fast code with a direct call to kaboom:

 (func $dostuff_nocatch\28\29 (; 32 ;) (type $11)
  (drop
   (call $puts
    (i32.const 1106)
   )
  )
  (call $kaboom\28\29)
  (drop
   (call $puts
    (i32.const 1126)
   )
  )
 )

If kaboom throws, the JavaScript + WebAssembly runtime unwinds the native call stack up through dostuff_nocatch until whatever surrounding code (if any) catches it. The second printf/puts call never gets made, as expected.

Catching gets clever

But if you have variables with destructors, as with the RAII pattern, or a try/catch block, things get weird fast.

void dostuff_catch() {
    try {
        printf("Doing ex stuff\n");
        kaboom();
    } catch (const char *e) {
        printf("Caught stuff\n");
    }
}

compiles to the much longer:

 (func $dostuff_catch\28\29 (; 27 ;) (type $11)
  (local $0 i32)
  (drop
   (call $puts
    (i32.const 1078)
   )
  )
  (i32.store
   (i32.const 14204)
   (i32.const 0)
  )
  (call $invoke_v
   (i32.const 1)
  )
  (local.set $0
   (i32.load
    (i32.const 14204)
   )
  )
  (i32.store
   (i32.const 14204)
   (i32.const 0)
  )
  (block $label$1
   (if
    (i32.eq
     (local.get $0)
     (i32.const 1)
    )
    (block
     (local.set $0
      (call $__cxa_find_matching_catch_3
       (i32.const 12432)
      )
     )
     (br_if $label$1
      (i32.ne
       (call $getTempRet0)
       (call $llvm_eh_typeid_for
        (i32.const 12432)
       )
      )
     )
     (drop
      (call $__cxa_begin_catch
       (local.get $0)
      )
     )
     (drop
      (call $puts
       (i32.const 1093)
      )
     )
     (call $__cxa_end_catch)
    )
   )
   (return)
  )
  (call $__resumeException
   (local.get $0)
  )
  (unreachable)
 )

A few things to note.

Here the direct call to kaboom is replaced with a call to the indirected invoke_v function:

  (call $invoke_v
   (i32.const 1)
  )

which is implemented in the JS runtime to wrap a try/catch around the call:

function invoke_v(index) {
  var sp = stackSave();
  try {
    dynCall_v(index);
  } catch(e) {
    stackRestore(sp);
    if (e !== e+0 && e !== 'longjmp') throw e;
    _setThrew(1, 0);
  }
}

Note this saves the stack position (in Wasm linear memory, for stack-allocated data — separate from the native/JS/Wasm call stack) and restores it, which allows functions that allocate stack data to participate in the faster call mode by not having to clean up their own stack pointer modifications.

If an exception is caught, a global flag is set which is checked and re-cleared after the call:

  ;; Load the threw flag into a local variable
  (local.set $0
   (i32.load
    (i32.const 14204)
   )
  )
  ;; Clear the threw flag for the next call
  (i32.store
   (i32.const 14204)
   (i32.const 0)
  )
  ;; Branch based on the stored state
  (block $label$1
   (if
    (i32.eq
     (local.get $0)
     (i32.const 1)
    )
    ....
   )
  )

If the flag wasn’t set after all, then the function continues on its merry way. If it was set, then it does its cleanup and either eats the exception (for catch) or re-throws it (for automatic cleanup like destructors).

Summary and surprises

So there’s several things that’ll slow down your code using C++ exception catching in emscripten, even if you never throw in the hot paths:

  • when catching, potentially throwing calls are indirected through JavaScript
  • destructors create implicit catch blocks, so you may be catching more than you think

There’s also some surprises:

  • in C++, every extern “C” function is potentially throwing!
  • “noexcept” functions that call potentially throwing functions add implicit catch blocks in order to call abort(), so they can guarantee that they do not throw exceptions themselves

So don’t just add “noexcept” to code willy-nilly or it may get worse.

It would be faster to use my own state flag and check it after every call (which could be hidden behind a template or a macro or something, probably, at least in part), since this would avoid calling out through JavaScript and indirecting all the calls.

Next I’ll look at another way in the Rust version, using Result<T, E> return types and the error propagation “?” shortcut in Rust…

Building Chromium for Windows ARM64

I heard a couple months back that someone had confirmed that you can build Chromium (open-source version of Chrome browser) for Windows ARM64… Since there’s still no general release, figured I’d give it a shot to test things in.

You have to build on Windows x64 (it won’t build on-device, it’d be too slow and may require running some x64 binaries from the tooling which wouldn’t work) and then copy the output over to an ARM64 device.

Some resources:

Tricky bits:

  • I built with Visual Studio 2019, but something in the build setup seems to want Visual Studio 2017’s C runtime library? Or something? Anyway I had to change a couple references in build/vs_toolchain.py from ‘Microsoft.VC141.CRT’ to ‘Microsoft.VC142.CRT’ and *.DebugCRT because I could not get it to look in the directory where the VC141 files were.
  • If you don’t turn on use_jumbo_build it will take a vveerryy lloonngg time. If you do, though, it will use more memory. Beware.
  • I had a lot of problems with the provided python instance not picking up imports from the depot_tools package, even after removing my other Python instances. Had to set PYTHONPATH=C:\src\depot_tools

My gn args for the build were:

# Required or else it defaults to x64
target_cpu = "arm64"

# To get a release-quality build for benchmarking
is_debug = false
is_component_build = false

# Coalesce multiple files and skip the Native Client stuff which won't be used in testing
use_jumbo_build = true
enable_nacl = false

I copied the resulting output directory over to the ARM64 machine in the morning; cd into the out dir and just run “chrome” and bam it works!

After enabling SIMD in chrome://flags, I tested ogv.js’s WebAssembly version of the dav1d AV1 decoder, with excellent results. Matching my earlier experiments with the Linux version under WSL (where I couldn’t get playback working due to missing audio and slow X11 drawing, but decoding benchmarks worked) I see decode times about 2x as fast as Firefox (which only uses a baseline compiler on ARM64, so isn’t as well optimized).

Enabling SIMD roughly doubles the throughput… and enabling threading roughly doubles it again.

As a result, I can get 720p24 AV1 video decoding and playing in WebAssembly fairly stably, with a few dropped frames here and there:

That’s not bad!

Very very much hoping that WebAssembly threading and SIMD will come to Safari, where we use ogv.js to play Ogg/WebM/VP9 (and in future AV1) video clips for Wikipedia. The recent model iPad and iPhone ARM64 chips are already amazing, and both desktops and mobiles will benefit from the performance improvements.

Also very much looking forward to Chrome and Edge coming to native ARM64 Windows. And looking forward to Firefox’s next-gen code generation engine, cranelift, improving their situation on these chips…. :)

Windows 10 on ARM64 testing

I’ve been curious about ARM64 (aarch64) and whether it’s up to the task of a modern, modest laptop for a while… finally picked up one of the second-generation Snapdragon 850-based Windows 10 “Always Connected PCs”, a Lenovo Yoga C630. It’s available in an 8gb RAM configuration which was enough to do some light development on, so I couldn’t help myself…

First thoughts on the machine: it’s not sure whether it’s a low-end or high-end product. Some aspects of it feel cheap, but nothing feels or works badly. The touchpad is decent enough, the keyboard is ok, and the 13″ 1080p screen is nicely colorful but feels a bit off. Some solid color areas look like it’s dithering visibly, which I’ve seen on cheaper LCDs. Speakers are definitely tinny. Fingerprint reader works ok for sign-in if you like that sort of thing.

First thoughts on the OS: it’s “just Windows 10”. :D Setup experience is like any other Win10 machine. It does start in “S mode” which limits you to store apps and built-ins… but you can switch that off in Settings at no cost.

This, I must point out, is where things fundamentally diverge from Microsoft’s previous Windows on ARM attempt, the Windows 8-era Windows RT. RT could not turn off the store restriction or the MS-signed restriction for Win32 apps, so you could only run Office (Win32) and whatever was in the Store (not much in those days).

In addition to being able to now run native ARM or ARM64 binaries — like Firefox! — from outside the store, you can now run 32-bit x86 binaries. Since most Windows software still ships 32-bit x86, this gives you a wide compatibility range. I installed Git for Windows, the Rust compiler, Visual Studio, all kinds of developer crud!

And, for extra fun the Windows Subsystem for Linux (version 1) is available, able to run aarch64 Linux binaries. I was able to build some of my test projects as well as real-world things like emscripten with Clang/LLVM under Ubuntu-in-WSL, running natively.

I was also able to enable the Windows Insider program to get the latest beta builds; ARM64 almost feels like a native part of the Windows ecosystem.

Well, sorta. :D

There are pain points. Emulated programs run a bit slower. Native binaries are rare. Building with Visual Studio is awkward because there’s no native ARM build tools, just cross tools you must run under emulation. If you wanted to do virtual machines in development, you’re stuck because there’s no Hyper-V on ARM64 (yet?).

But there are big pluses: battery life seems long, there’s built-in LTE which “just works” once set up, and the Linux environment should be adequate for a manual MediaWiki dev setup.

Performance of the CPU is surprisingly decent, though single-threaded throughput is slower than the A11 in my iPhone X. It also throttles pretty aggressively, shutting down some of the “big” powerful cores after a few seconds of sustained usage and diverting more threads to the “little” low-power cores. This makes things like an LLVM compile a lot slower than they’d be running full-tilt in a server or workstation environment. For a fanless laptop that’s good thermal management, but beware.

All in all I think I’m going to enjoy fiddling with this machine, and will find it useful for travel thanks to its light weight, LTE, and USB-C charging. But I can’t help think it’d be twice as cool if it ran stock Fedora or Ubuntu as the main OS. ;) (I have no idea if that’s theoretically possible if you disable secure boot. I’ll leave it as an exercise to someone!)

Wasm RPC thoughts

I spent far too much time lately going down a rabbit hole thinking about how to do a safe, reasonably efficient, reasonably sane API for cross-WebAssembly-module function calls and data transfer… this would allow a Wasm or JS “app” to interact with Wasm “plugins” with certain guarantees about the plugin’s ability to access host memory and functions.

Ended up putting together a fake “slide deck” to explain it to myself, which y’all are welcome to enjoy below. :)

I’ll keep this going in the background while I’m catching up on other stuff, then get it back up and running with my plugin research later.

SIMD in WebAssembly – tales from the bleeding edge

While benchmarking the AV1 and VP9 video decoders in ogv.js, I bemoaned the lack of SIMD vector operations in WebAssembly. Native builds of these decoders lean heavily on SIMD (AVX and SSE for x86, Neon for ARM, etc) to perform operations on 8 or 16 pixels at once… Turns out there has been movement on the WebAssembly SIMD proposal after all!

Chrome’s V8 engine has implemented it (warning: somewhat buggy still), and the upstream LLVM Wasm code generator will generate code for it using clang’s vector operations and some intrinsic functions.

emscripten setup

The first step in your SIMD journey is to set up your emscripten development environment for the upstream compiler backend. Install emsdk via git — or update it if you’ve got an old copy.

Be sure to update the tags as well:

./emsdk update-tags

If you’re on Linux you can download a binary installation, but there’s a bug in emsdk that will cause it not to update. (Update: this was fixed a few days ago, so make sure to update your emsdk!)

./emsdk install latest-upstream
./emsdk activate latest-upstream

On Mac or Windows, or to install the latest upstream source on purpose, you can have it build the various tools from source. There’s not a convenient “sdk” catch-all tag for this that I can see, so you may need to call out all the bits:

./emsdk install emscripten-incoming-64bit
./emsdk activate emscripten-incoming-64bit
./emsdk install upstream-clang-master-64bit
./emsdk activate upstream-clang-master-64bit
./emsdk install binaryen-master-64bit
./emsdk activate binaryen-master-64bit

First build may take a couple hours or so, depending on your machine.

Re-running the install steps will update the git checkouts and re-build, which doesn’t take as long as a fresh build usually but can still take some time.

Upstream backend differences

Be warned that with the upstream backend, emscripten cannot build asm.js, only WebAssembly. If you’re making mixed JS & WebAssembly builds this may complicate your build process, because you have to switch back.

You can switch back to the current fastcomp backend at any time by swapping your sdk state back:

./emsdk install latest
./emsdk activate latest

Note that every time you switch, the cached libc will get rebuilt on your next emcc invocation.

Currently there are some code-gen issues where the upstream backend produces more local variables and such than the older fastcomp backend, which can cause a slight slowdown in code of a few % (and for me, a bigger slowdown in Safari which does particularly well with the old compiler’s output). This is being actively worked on, and is expected to improve significantly soon.

Starting Chrome

Now you’ve got a compiler; you’ll also need a browser to run your code in. Chrome’s V8 includes support behind an experimental runtime flag; currently it’s not exposed to the user interface so you must pass it on the command line.

I recommend using Chrome Canary on Mac or Windows, or a nightly Chromium build on Linux, to make sure you’ve got any fixes that may have come in recently.

On Mac, one can start it like so:

/Applications/Google\ Chrome\ Canary.app/Contents/MacOS/Google\ Chrome\ Canary --js-flags="--experimental-wasm-simd"

If you forget the command-line flag or get it wrong, the module compilation won’t validate the SIMD instructions and will throw an exception, so you can’t run by mistake. (This is also apparently how you’re meant to test for SIMD support presence, AFAIK… compile a module and see if it works?)

Beware there are a few serious bugs in the V8 implementation, which may trip you up. In particular watch out for broken splat which can produce non-deterministic errors. For this reason I recommend disabling autovectorization for now, since you have no control of workarounds on the clang end. Non-constant bit shifts also fail to validate, requiring a scalar workaround.

Vector ops in clang

If you’re not working in C/C++ but are implementing your own Wasm compiler or hand-writing Wasm source, skip this section! ;) You’ll want to checkout the SIMD proposal documentation for a list of available instructions.

First, forget anything you may have heard about emscripten including MMX compatibility headers (xmmintrin.h etc). They’ve been recently removed as they’re broken and misleading.

There’s also emscripten/vector.h but it seems obsolete as well with some references to functions that don’t exist that were from the old SIMD.js implementation, and I recommend avoiding it for now.

The good news is, a bunch of vector stuff “just works” using standard clang syntax, and there’s a few intrinsic functions for particular operations like the bitselect instruction and vector shuffling.

First, take some plain ol’ code and compile it with SIMD enabled:

emcc -o foo.html -O3 -s SIMD=1 -fno-vectorize foo.c

It’s important for now to disable autovectorization since it tends to break on the V8 splat bug for me. In the future, you’ll want to leave it on to squeeze out the occasional performance increase without manual intervention.

If you’re not using headers which predefine standard vector types for you, you can create some convenient aliases like so (these are just the types I’ve used so far):

typedef int16_t int16x8 __attribute((vector_size(16)));

typedef uint8_t uint8x16 __attribute((vector_size(16)));
typedef uint16_t uint16x8 __attribute((vector_size(16)));
typedef uint32_t uint32x4 __attribute((vector_size(16)));
typedef uint64_t uint64x2 __attribute((vector_size(16)));

The expected float and signed/unsigned integer interpretations for 128 bits are available, and you can freely cast between them to reinterpret bit sizes.

To work around bugs in the “splat” operation that expands a scalar to a vector, I made inline helper functions for myself:

static volatile int junk = 0;

static inline int16x8 splat_const(const int16_t val) {
// Use this only on constants due to the v8 splat bug!
return (int16x8){
val, val, val, val,
val, val, val, val
};
}

static inline int16x8 splat_vec(const int16_t val) {
// Try to work around issues with broken splat in V8
// by forcing the input to be something that won't be reused.
const int guarded = val + junk;
return (int16x8){
guarded, guarded, guarded, guarded,
guarded, guarded, guarded, guarded
};
}

Once the bug is fixed, it’ll be safe to remove the ‘junk’ and ‘guarded’ bits and use a single splat helper function. Though I’m sure there’s got to be a visibly clearer way to do a splat than manually writing out all the lanes and having the compiler coalesce them into a single splat operation? o_O

The bitselect operation is also frequently necessary, especially since convenient operations like min, max, and abs aren’t available on integer vectors. You might or might not be able to do this some cleaner way without the builtin, but this seems to work:

static inline int16x8 select_vec(const int16x8 cond,
const int16x8 a,
const int16x8 b) {
return (int16x8)__builtin_wasm_bitselect(a, b, cond);
}

Note the order of parameters on the bitselect instruction and the builtin has the condition last — I found this maddening so my helper function has the condition first where my code likes it.

You can now make your own vector abs function:

static inline int16x8 abs_vec(const int16x8 v) {
return select_vec(v < splat_const(0), -v, v);
}

Note the < and – operators “just work” on the vectors, we only needed helper functions for the bitselect and the splat. And I’m still not confident I need those two?

You’ll also likely need the vector shuffle operation, which there’s a standard builtin for. For instance here I’m deinterleaving 16-bit pixels into 8-bit pixels and the extra high bytes:

static inline uint8x16 merge_pixels(const int16x8 work) {
return (uint8x16)__builtin_shufflevector((uint8x16)work, (uint8x16)work,
0, 2, 4, 6, 8, 10, 12, 14, // the 8 pixels we worked on
1, 3, 5, 7, 9, 11, 13, 15 // zeroes we don't need
);
}

Checking compiler output

To figure out what’s going on it helps a lot to disassemble the WebAssembly output to confirm what instructions are actually emitted — especially if you’re hoping to report a bug. Refer to the SIMD proposal for details of instructions and types used.

If you compile with -g your .wasm output will include function names, which make it much easier to read the disassembly!

Use wasm-dis like so:

wasm-dis foo.wasm > foo.wat

Load up the .wat in your code editor of choice (there are syntax highlighting plugins available for VS Code and I think Atom etc) and search for your function in the mountain of stuff.

Note in particular that bit-shift operations currently can produce big sequences of lane-shuffling and scalar bit-shifts. This is due to the LLVM compiler working around a V8 bug with bit-shifts, and will be fixed soon I hope.

If you wish, you can modify the Wasm source in the .wat file and re-assemble it to test subtle changes — use wasm-as for this.

Reporting bugs

You probably will encounter bugs — this is very bleeding-edge stuff! The folks working on it want your feedback if you’re working in this area, so please make the most of it by providing reproducible test cases for any bugs you encounter that aren’t chalked down to the existing splat argument corruption and non-constant shift bugs.

And beware that until the splat bug is fixed, non-deterministic problems are really easy to pop up.

The various trackers:

ogv.js 1.6.0 released with experimental AV1 decoding

After some additional fixes and experiments I’ve tagged ogv.js 1.6.0 and released it. As usual you can use ‘ogv’ package on npm or fetch the zip manually. This includes various fixes, including for some weird bugs!, and performance improvements on lower-end machines. Internals have been partially refactored to aid future maintenance, and experimental AV1 decoding has been added using VideoLAN’s dav1d decoder.

dav1d and SIMD

The dav1d AV1 decoder is now working pretty solidly, but slowly. I found that my test files were encoded at too high a quality and dialed them back to my actual target bitrate and find that performance improves as a consequence, so hey! Not bad. ;)

I’ve worked around a minor compiler issue in emscripten’s old “fastcomp” asm.js->wasm backend where an inner loop didn’t get unrolled, which improves decode performance by a couple percent. Upstream prefers to let the unroll be implicit, so I’m keeping this patch in a local fork for now.

I’ve also been reached out to by some folks working on the WebAssembly SIMD proposal, which should allow speeding up some of the slow filtering operations with optimized vector code! The only browser implementation of the proposal (which remains a bit controversial) is currently Chrome, with an experimental command-line flag, and the updated vectorization code is in the new WebAssembly compiler backend that’s integrated with upstream LLVM.

So I spent some time getting up and running on the new LLVM backend for emscripten, found a few issues:

  • emsdk doesn’t update the LLVM download properly so you can get stuck on an old version and be very confused — this is being fixed shortly!
  • currently it’s hard to use a single sdk installation for both modes at once, and asm.js compilation requires the old backend. So I’ve temporarily disabled the asm.js builds on my simd2 work branch.
  • multithreaded builds are broken atm (at least modularized, which only just got fixed on the main compiler so might need fixes for llvm backend)
  • use of atomics intrinsics in a non-multithreaded build results in a validation error, whereas it had been silently turned into something safe in the old backend. I had to patch dav1d with a “fake atomics” option to #define them away.
  • Non-SIMD builds were coming out with data corruption, which I tracked down to an optimizer bug which had just been fixed upstream the day before I reported it. ;)
  • I haven’t gotten as far as working with any of the SIMD intrinsics, because I’m getting a memory access out of bounds issue when engaging the autovectorizer. I narrowed down a test case with the first error and have reported it; not sure yet whether the problem is in the compiler or in Chrome/V8.

In theory autovectorization is likely to not do much, but significant gains could be made using intrinsics… but only so much, as the operations available are limited and it’s hard to tell what will be efficient or not.

Intermittent file breakages

Very rarely, some files would just break at a certain position in the file for reasons I couldn’t explain. I had one turn up during AV1 testing where a particular video packet that contained two frame OBUs had one OBU header appear ok and the second obviously corrupted. I tracked the corruption back from the codec to the demuxer to the demuxer’s input buffer to my StreamFile abstraction used for loading data from a seekable URL.

Turned out that the offending packet straddled a boundary between HTTP requests — between the second and third megabytes of the file, each requested as a separate Range-based XMLHttpRequest, downloaded as binary strings so the data can be accessed during progress events. But according to the network panel, the second and third megabytes looked fine…. but the *following* request turned up as 512 KiB. …What?

Dumping the binary strings of the second and third megabytes, I immediately realized what was wrong:

Enjoy some tasty binary strings!

The first requests were as expected showing 8-bit characters (ASCII and control chars etc). The request with the broken packet was showing CJK characters indicating the string had probably been misinterpreted as UTF-16

It didn’t take much longer to confirm that the first two bytes of the broken request were 0xFE 0xFF, a UTF-16 Byte Order Mark. This apparently overrides the “overrideMimeType” method’s x-user-defined charset, and there’s no way to override it back. Hypothetically you could probably detect the case and swap bytes back but I think it’s not actually worth it to do full streaming downloads within chunks for the player — it’s better to buffer ahead so you can play reliably.

For now I’ve switched it to use ArrayBuffer XHRs instead of binary strings, which avoids the encoding problem but means data can’t be accessed until each chunk has finished downloading.

ogv.js experimental AV1 decoding

The upcoming ogv.js 1.6.0 release will be the first to include experimental AV1 support, using the dav1d decoder. Thanks to ePirat for the initial work in emscripten cross-compiling the dav1d codebase!

Performance is not very great, but may improve a bit in future from optimizations and potentially a lot from new platform features that may come to WebAssembly in the future.

In particular on Internet Explorer which lacks WebAssembly, performance is very poor but does work at very low resolutions on reasonably fast machines.

On my 2015 MacBook Pro (3.1 GHz 5th-gen Core i7), I can get somewhere between 360p and 480p on the “Caminandes – Llamigos” demo in Safari, while the current VP9 codec gives me 720p.

Safari has a great WebAssembly engine, giving 720p for VP9 or a solid 360p for AV1. 480p AV1 would be achievable with threading.

In IE 11, high-motion scenes in AV1 top up the CPU at only 120p, while VP9 gets away with 240p or so.

IE 11 runs several resolution steps lower, limited by its slow JavaScript engine. It will never get faster, we can only hope it will be gradually replaced.

Multithreaded WebAssembly builds are also included, thanks to emscripten fixing support for modularized threaded programs in 1.38.27. These however do not work in Safari because it has not yet added back SharedArrayBuffer support after it was removed as part of Spectre mitigations.

You can test the threaded builds in Chrome and Firefox with suitable flags enabled (“Wasm threading” for Chrome and “shared memory” for Firefox). VP9 scales well to 2 or 4 threads depending on the resolution, and AV1 scales to 2-ish threads. Will continue to tune and work on this for the future day when Safari supports threading.

Another area where WebAssembly doesn’t perform well is the lack of SIMD instructions — in many places there are tight loops of arithmetic that can be parallelized with vector computation, and native builds of the decoders make extensive use of SIMD. There is some experimental support in some browsers and emscripten but I’m not sure how well they talk to each other or how finalized the standard is so I haven’t tried it.

(It’s conceivable that browser engines could auto-vectorize tight loops in WebAssembly but they would probably be limited to 32-bit arithmetic at best, which wouldn’t parallelize as much as things that can work with 16-bit or 8-bit wide lanes.)

A peek at ogv.js updates

Between other projects, I’ve done a little maintenance on our ogv.js WebM player shim for Safari and IE. This should go out next week as 1.6.0 and should remain compatible with 1.5.x but has a lot of internals refactoring.

Experimental features

This release include an experimental AV1 video decoder using the dav1d library, based on ePirat’s initial work getting it to build with emscripten. This is an initial test, and needs to be brought back up to date with upstream, optimized, etc.

Also included are multithreaded VP8 and VP9 decoders, brought back from the past thanks to emscripten landing updated support and browsers getting closer to re-enabling SharedArrayBuffer. Performance on 2-core and 4+-core machines is encouraging in Firefox and Chrome with flags enabled, but cannot yet be tested in Safari.

SD videos scale over 2 cores reasonably well, with a solid 50% or more decoding speed boost visible. HD can scale over 4 cores, with about 200% speed boost.

Threading must be manually opted in. The AV1 decoder is not yet threaded.

Low-end performance work

In Internet Explorer 11, performance is weaker across the board versus other browsers because its JS engine is an old, frozen version that’s not been optimized for asm.js or WebAssembly. In addition, machines where people are running IE 11 are probably more likely to be older, slower machines.

I did some optimization work which greatly improves perceived playback performance of a 120p WebM VP9/Opus file on my slowest test machine, a 1.67 GHz Atom netbook tablet from 2012 or so.

  • more aggressively selecting software YCbCr-RGB conversion when WebGL is going to be worse
  • Microoptimizations to conversion and extraction of YCbCr data from heap
  • Replaced timer-based audio packet consolidation with buffer-size-based one
  • Tuned audio communication with Flash to reduce calls, per-call overhead
  • Increased number of buffered audio and video frames to survive short pauses better
  • Fix for dropping of individual frames when an adjacent decoded frame is available

I don’t think there’s a lot left to be done in optimizing VP9 decoding for IE; the largest components in profiling at low resolutions are now things like vpx_convolve8_c which really would benefit from integer multiplication and just not being so darn slow… I tried a replacement function micro-optimizing to factor out common additions and bit-shifts in the generated emscripten code but it didn’t seem to make any difference; either that’s the one bit the IE optimizer catches anyway or the bit-shifts are so cheap compared to the memory accesses and multiplications that it just makes no measurable difference. :P

Ah well! This is why I started producing files down to 120p, to handle the super-slow cases.

What is important is making sure that it plays audio smoothly, which it’s now better at than before. And in production we’ll probably add a notice for slow decoding recommending picking another browser…

Refactoring

I’ve also started refactoring work in preparation for future changes to support MSE-style buffering, needed for adaptive bitrate streaming.

Instead of a hodge-podge of closure functions and local variables, I’ve transitioned to using ES6 modules and classes, with a babel transform step for ES5 compatibility.

This helps distinguish retained state vs local variables, makes it easier to find state when debugging, and generally makes things easier to work with in the source. More cleanup still needs to be done in the larger processing functions and the various state vars, but it’s an improvement so far.