Windows ARM64 Visual Studio Code Insiders now available

Just want to give a shout-out to the wonderful folks at Microsoft and elsewhere who have gotten a Visual Studio Code Insiders build created for Windows on ARM64, which runs natively on the Surface Pro X and other ARM64 machines.

It’s still not listed in the regular downloads but it works for me when installed directly, and should auto-update with further Insiders releases. :)

The x86 build ran acceptably for some light development on the Surface Pro X in emulation, but the native build feels a *lot* faster. Starts up instantly, no longer so sluggish to scroll or wait for linter updates.

Now all I need is Docker for Win10/ARM64 and for WSL2 to fix the ARM64 performance problems with Hyper-V. :)

Surface Pro X thoughts after a few weeks

After a few weeks using the fancy new Windows 10 ARM64 tablet, the Surface Pro X, I’ve got a few thoughts. Mostly good so far, but it remains an early adopter device with a few rough edges (virtually — the physical edges are smooth and beautiful!) Note that my use cases are not everyone’s use cases, so some people will have even more luck, or even less luck, getting things working. :) Your mileage can and will vary.

Hardware

It’s just gorgeous. Too gorgeous. It’s all black-on-black labeled with black type. Mostly this is fine, but I find it hard to find the USB-C ports on the left side when I’ve got it propped up on its stand. :)

Seriously though, my biggest hardware complaint is that the bezels are too small for its size when holding as a tablet in the hands — I keep hitting the corners with my fat hands and opening the start menu or closing an app. I’m still not 100% sold on the idea of tablets-with-keyboards at this size (13″ diagonal or so).

But for watching stuff, the screen is *fantastic*. The 3:2 aspect ratio is also much better for anything that’s not video, while still not feeling like I’ve wasted much space on a 16:9 letterbox.

The keyboard attachment is pretty good. Get it. GET IT. I got the one that also has the cradle for the pen, which I never use but felt like I had to try out. If I did more art I would probably use it.

Performance and emulation

The CPU is really good. It’s got a huge speed boost over the Snapdragon 835 and 850 in older ARM64 Windows machines, and feels very snappy in native apps like Firefox or the new Edge. With 4 high-power CPU cores and 4 low-power cores, it handles multithreaded workloads fairly well unless they get confused by the scheduler… I’ve sometimes seen things have background threads get pushed to the low-power cores where they take a long time to run.

(In Task Manager, you can see the first 4 cores are the low-power cores, the next 4 are high-power.)

x86 Windows software is supported via emulation, both for store apps and regular win32 apps you find anywhere. But not everything works. I’ve generally had good luck with tools and applications – Visual Studio, VS Code, Chrome, Git for Windows, Krita, Inkscape all run. But about 1/2 of the Steam games I tried failed to run, maybe more. And software that’s x64-only won’t run at all, as there’s no emulator support for 64-bit code.

Emulated code in my unscientific testing runs 2-3 times slower than native code on sustained loops, but you can expect loading-time stuff to be slower because things have to get traced/compiled on the first run through or when code is modified in memory.

Nonetheless, 2-3 times slower than really-fast is still not-bad, and for UI-heavy or i/o-heavy applications it’s not too significant. I’ve had no real complaints using the x86 VS Code front-end, but more complaints with, say, compiling things in Visual Studio. :)

Web use case

Most of what I use a computer for these days is in a web browser environment, so “using the web” is big. Firefox has an optmized, native ARM64 build. Works great. ’nuff said.

Oh also Edge preview builds in the Dev and Canary channel are ARM64 native and run great, if you like that sort of thing.

Chrome, however, has not released a native build and will run in x86 emulation. If you need Chrome specifically it *will install and run* but it will be slow. Do not grab custom Chromium builds unless you’re using them only for testing, as they will not be secure or get updated!

Developer use case

I’m a software developer, so in addition to “everything that goes in a web browser” I need to use tools to work on a combination of stuff, mostly:

  • PHP and client-side JavaScript code (MediaWiki, a few other bits)
  • weird science C / JavaScript / emscripten / WebAssembly stuff (ogv.js, which plugs into MediaWiki’s video player extension)
  • research work in Rust (mtpng threaded PNG compressor)

LAMP stuff

I’m used to working in either a macOS or Linux environment, with Unix-like command line tools and usually a separate GUI text editor like Visual Studio Code, and never had good experiences trying to run the MediaWiki LAMP-stack tools on a Windows environment in years past. Even with Vagrant managing a VM, it had proved more fragile on Windows for me than on Mac or Linux.

WSL (Windows Subsystem for Linux) has changed that. I can run a Debian or Ubuntu system with less overhead and better integration to the host system than running in a traditional VM like VirtualBox or Hyper-V. On the Surface Pro X, you get the aarch64 distribution of Ubuntu or Debian (or whatever other supporting distro you choose to install) so it runs full speed, with no emulation overhead.

I’ve been using a MediaWiki git checkout in an Ubuntu setup, using the standard PHP/Apache/MySQL/whatevers and manually running git & composer updates. The main downside to using WSL here is that services don’t get started automatically because it doesn’t run the traditional init process, but “service mysql start” etc works as expected and gets you working.

For editing, I use Visual Studio Code. This is not yet available as an ARM64 optimized build (the x86 frontend runs in emulation), but does in 1.41 now include ARM64 support for WSL integration — which means you can run the PHP linter on your code running inside the Linux environment while your editor frontend is a native Windows GUI app. No wacky X11 hacks required.

emscripten stuff

The emscripten compiler for WebAssembly stuff works great, but doesn’t ship ARM or ARM64 builds for any platform yet in the emsdk tool.

You can build manually from source for now, and hopefully I can get builds working from the emsdk installer too (though you still would have to run the build yourself).

The main annoyance I had was that Ubuntu LTS currently ships an old node.js, which I had to supplement with a newer build to get my environment the way I wanted it for my scripts. :) This was pretty straightforward.

Rust stuff

Rust includes support for building code for Windows ARM64 — it has to to support things like Firefox! — but the compiler & tools distribution comes as x86. I’m sure this will eventually get worked out, but for now if you install Rust on Windows you’ll get the x86 build and may have to manually add the aarch64 target. But it does work — I can compile and run my mtpng project for Windows 10 ARM64 on the device.

Within a WSL environment, you can install Rust for Linux aarch64 and it “just works” as you’d expect, as well.

Final notes

All in all, pretty happy with it. I might have preferred a Surface Laptop X with similar specs but a built-in keyboard, but at a desk or other …. “surface” … it works fine for typey things like programming.

Certainly I prefer the keyboard to the keyboard on my 2018 MacBook Pro. ;)

Overflowing stacks in WebAssembly, whoops!

Native C-like programming environments tend to divide memory into several regions:

  • static data contains predefined constants loaded from the binary, and space for global variables (“data”)
  • your actual code has to go in memory too! (“text”)
  • space for dynamic memory allocation (“heap”), which may grow “up” as more memory is needed
  • space for temporary data and return addresses for function calls (“stack”), which may grow “down” as more memory is needed

Usually this is laid out with the stack high in the address space and the heap lower in the address space, if I recall correctly? Allocating more heap is done when you need it via malloc, and the stack can either grow or warn of an overflow by using the CPU’s memory manager to detect use of data pages incremented beyond the edge of the stack.

In emscripten’s WebAssembly porting environment, things are similar but a little different:

  • code doesn’t live in linear memory, so functions don’t have memory addresses
  • because code return addresses and small local variables also live separately, only arrays/structs and vars with address taken must be on stack.
  • usable memory is continguous; you can’t have a sparse address space where stack and heap can both grow.

As a result, the stack is fixed size and there’s some fragility, but the stack uses less space usually.

Currently the memory layout is to start with static data, follow with the stack, and then the heap. The stack grows “down”, meaning when you reach the end of the stack you end up in static data territory and can overwrite global variables. This leads to weird, hard to detect errors.

When making debug builds with assertions there are some “cookie” checks to ensure that some specific locations at the edge of the stack have not been overwritten at various times, but this doesn’t always catch things if you only overwrote the beginning of a buffer in static data and not the part that had the cookie. :) It also doesn’t seem to trigger on library workflows where you’re not running through the emscripten HTML5/SDL/WebGL runtime.

There’s currently a PR open to reduce the default stack size from 5 MiB to 0.5 MiB, which reduces the amount of memory needed for small modules significantly, and we’re chatting a bit about detecting errors in the case that codebases have regressions…

One thing that’s come up is the possibility of moving the stack to before static data, so you’d have: stack, static data, heap.

This has two consequences:

  • any memory access beyond the stack end will wrap around 0 into unallocated memory at the top of the address space, causing an immediate trap — this is done by the memory manager for free, with no false positives or negatives
  • literal references to addresses in static data will be larger numbers, thus may take more bytes to encode in the binary (variable-length encoding is used in WebAssembly for constants)

Probably the safety and debugging win would be a bigger benefit than the size savings, though potentially that could be a win for size-conscience optimizations when not debugging.

Rust error handling with Result and Option (WebAssembly ABI)

In our last adventure we looked at C++ exceptions in WebAssembly with the emscripten compiler. Now we’re taking a look at the main error handling system for another language targeting WebAssembly, Rust.

Rust has a “panic”/”unwind” system similar to C++ exceptions, but it’s generally recommended against catching panics. It also currently doesn’t work in WebAssembly output, where Rust doesn’t use emscripten’s tricks of calling out to JavaScript and is waiting for proper exception integration to arrive in Wasm.

Instead, most user-level Rust code expresses fallible operations using the Result or Option enum types. An Option<T> can hold either some value of type T — Some(T) — or no value — None. A Result<T, E> is fancier and can hold either a data payload Ok(T) or a Err(E) holding some information about the error condition. The E type can be a string or an enum carrying a description of the error, or it can just be an empty () unit type if you’re not picky.

Rust’s “?” operator is used as a shortcut for checking and propagating these error values.

For instance a function to store to memory (which might fail if the address is invalid or non-writable) you could write:

    // Looking up a memory page for writing could fail,
    // if the page is unmapped or read-only.
    pub fn page_for_write(&mut self, addr: GuestPtr)
    -> Result<&mut PageData, Exception> {
       // ... returns either Ok(something)
       // or Err(Exception::PageFault)
    }

    pub fn store_8(&mut self, addr: GuestPtr, val: u8)
    -> Result<(), Exception> {

        // First some address calculations which cannot fail.
        let index = addr_to_index(addr);
        let page = addr_to_page(addr);

        let mem = self.page_for_write(page)?;

        // The "?" operator checked for an Err(_) condition
        // and if so, propagated it to the caller. If it was
        // Ok(_) then we continue on with that value.
        mem.store_8(index, val);

        Ok(())
    }

This looks reasonably clean in the source — you have some ?s sprinkled about but they indicate that a branch may occur, which is good to help reason about the performance. This makes things more predictable when you look at it — addition of a destructor in C++ might make a fast path suddenly slow with emscripten’s C++/JavaScript hybrid exceptions, while the worst that happens here is a visible flag check which is not so bad, right?

Well, mostly. The beautiful, yet horrible thing about Rust is that automatic optimization lets you create these wonderful clean conceptual APIs that compile down to fast code. The horrible part is that you sometimes don’t know how or why it’s doing what, and there can still be surprises!

The good news is that there’s never any JavaScript call-outs in your hot paths, unlike emscripten’s C++ exception catching. Yay! But the bad news is that in WebAssembly, returning a structure like a Result<T,E> or an Option<T> that doesn’t resolve into a single word can be complicated. And knowing what resolves is complicated. And sometimes things get packed into integers and I don’t even know what’s doing that.

Crash course on enum layout

Rust “enums” can be much more fancy than C/C++ “enums”, with the ability not only to carry a discriminator of which enumerated value they carry, but to carry structure-like data payloads within them.

For Option<T>, this means you have a 1-bit discriminant between Some(_) or None and then the storage payload of the T type. If T is itself a non-zero or non-nullable type, or an enum with free space (a “niche”) then it can be optimized into a single word — for instance an Option<&u8> takes only a single pointer word of space because null references are not possible, while Option<*const u8> would take two words. When we use Result<(), E> for a return value on fallible operations that don’t need to return data, like the store_8 function above, we get that same benefit.

When you can return a single word, life is simple. At the low-level JIT implementation it’s probably passed in a register, and everything is awesome and fast.

However we have two words on data-bearing Result<T,E>s, and WebAssembly can’t return two words at once from a function. Instead, most of the time, the function is transformed to not use the return value, and instead send the address of a stack location to write the structure to memory, which means two memory stores and at least one memory read to check the discriminant value. Plus you have to manipulate the stack pointer, which you might not have needed otherwise on some functions.

This is not great, but is probably better than calling out to JavaScript and back. I haven’t yet benchmarked tight loops with this.

Packed enums in single words

I’ve also sometimes seen Result enums get packed into a single word, and I don’t yet understand the circumstances that this happens.

On a WebAssembly build, I’ve had it happen with Result<u16, E> where E is a small fieldless enum: it’s packed into a single 32-bit integer. But Result<u8, E> is sent through the stack, and Result<u32, E> is too, though it could have been packed into a single 64-bit integer.

On native x86-64 builds on macOS, I see the word packing for both Result<u16, E> and Result<u32, E>… meanwhile Result<u64, E> uses the stack, but Result<u8, E> uses two words returned in different registers.

I know the internals of Rust layout and calling conventions are officially undocumented and can change, but it’d be nice to know roughly what some of them are. ;)

Further information on Rust layout and ABI

Exception handling in emscripten: how it works and why it’s disabled by default

One of my “fun” projects I’m poking at on the side is an x86-64 emulator for the web using WebAssembly (not yet online, nowhere near working). I’ve done partial spike implementations in C++ and Rust with a handful of interpreted opcodes and emulation of a couple Linux syscalls; this is just enough for a hand-written “Hello, world!” executable. ;)

One of the key questions is how to handle exceptional conditions within the emulator, such as an illegal instruction, page fault, etc… There are multiple potentially fallible operations within the execution of a single opcode, so the method chosen to communicate those exceptions could make a big impact on performance.

For C++, one method which looks nice in the source code and can be good on native platforms is to use C++’s normal exception facility. This turns out to be somewhat problematic in emscripten builds for WebAssembly, which is explained by looking at its implementation.

In fact it’s considered enough of a performance problem that catching exceptions is disabled by default in emscripten — throwing will abort the program with no way to catch it except at the surrounding JavaScript host page level. To get exceptions to actually work, you must pass -s DISABLE_EXCEPTION_CATCHING=0 to the compiler.

Throwing is easy

Let’s write ourselves a potentially-throwing function:

static volatile int should_explode = 0;

void kaboom() {
    if (should_explode) {
        throw "kaboom";
    }
}

Pretty exciting! The volatile int is just to make sure the optimizer doesn’t do anything clever for us, as real world functions don’t _always_ or _never_ throw, they _sometimes_ throw. At -O1 this compiles to WebAssembly something like this:

 (func $kaboom\28\29 (; 26 ;) (type $11)
  (local $0 i32)
  (if
   (i32.load
    (i32.const 13152)
   )
   (block
    (i32.store
     (local.tee $0
      (call $__cxa_allocate_exception
       (i32.const 4)
      )
     )
     (i32.const 1024)
    )
    (call $__cxa_throw
     (local.get $0)
     (i32.const 12432)
     (i32.const 0)
    )
    (unreachable)
   )
  )
 )

The function $__cxa_throw calls into emscripten’s JavaScript-side runtime to throw a native JS exception.

That’s right, WebAssembly doesn’t have a way to throw — or catch — an exception itself. Yet. They’re working on it, but it’s not done yet.

Calling best case

So now we want to call that function. In the best case scenario, there’s a function that doesn’t have any cleanup work to do after a potentially throwing call: no try/catch clauses, no destructors, no stack allocations to clean up.

void dostuff_nocatch() {
    printf("Doing nocatch stuff\n");
    kaboom();
    printf("Possibly unreachable stuff\n");
}

This compiles to nice, simple, fast code with a direct call to kaboom:

 (func $dostuff_nocatch\28\29 (; 32 ;) (type $11)
  (drop
   (call $puts
    (i32.const 1106)
   )
  )
  (call $kaboom\28\29)
  (drop
   (call $puts
    (i32.const 1126)
   )
  )
 )

If kaboom throws, the JavaScript + WebAssembly runtime unwinds the native call stack up through dostuff_nocatch until whatever surrounding code (if any) catches it. The second printf/puts call never gets made, as expected.

Catching gets clever

But if you have variables with destructors, as with the RAII pattern, or a try/catch block, things get weird fast.

void dostuff_catch() {
    try {
        printf("Doing ex stuff\n");
        kaboom();
    } catch (const char *e) {
        printf("Caught stuff\n");
    }
}

compiles to the much longer:

 (func $dostuff_catch\28\29 (; 27 ;) (type $11)
  (local $0 i32)
  (drop
   (call $puts
    (i32.const 1078)
   )
  )
  (i32.store
   (i32.const 14204)
   (i32.const 0)
  )
  (call $invoke_v
   (i32.const 1)
  )
  (local.set $0
   (i32.load
    (i32.const 14204)
   )
  )
  (i32.store
   (i32.const 14204)
   (i32.const 0)
  )
  (block $label$1
   (if
    (i32.eq
     (local.get $0)
     (i32.const 1)
    )
    (block
     (local.set $0
      (call $__cxa_find_matching_catch_3
       (i32.const 12432)
      )
     )
     (br_if $label$1
      (i32.ne
       (call $getTempRet0)
       (call $llvm_eh_typeid_for
        (i32.const 12432)
       )
      )
     )
     (drop
      (call $__cxa_begin_catch
       (local.get $0)
      )
     )
     (drop
      (call $puts
       (i32.const 1093)
      )
     )
     (call $__cxa_end_catch)
    )
   )
   (return)
  )
  (call $__resumeException
   (local.get $0)
  )
  (unreachable)
 )

A few things to note.

Here the direct call to kaboom is replaced with a call to the indirected invoke_v function:

  (call $invoke_v
   (i32.const 1)
  )

which is implemented in the JS runtime to wrap a try/catch around the call:

function invoke_v(index) {
  var sp = stackSave();
  try {
    dynCall_v(index);
  } catch(e) {
    stackRestore(sp);
    if (e !== e+0 && e !== 'longjmp') throw e;
    _setThrew(1, 0);
  }
}

Note this saves the stack position (in Wasm linear memory, for stack-allocated data — separate from the native/JS/Wasm call stack) and restores it, which allows functions that allocate stack data to participate in the faster call mode by not having to clean up their own stack pointer modifications.

If an exception is caught, a global flag is set which is checked and re-cleared after the call:

  ;; Load the threw flag into a local variable
  (local.set $0
   (i32.load
    (i32.const 14204)
   )
  )
  ;; Clear the threw flag for the next call
  (i32.store
   (i32.const 14204)
   (i32.const 0)
  )
  ;; Branch based on the stored state
  (block $label$1
   (if
    (i32.eq
     (local.get $0)
     (i32.const 1)
    )
    ....
   )
  )

If the flag wasn’t set after all, then the function continues on its merry way. If it was set, then it does its cleanup and either eats the exception (for catch) or re-throws it (for automatic cleanup like destructors).

Summary and surprises

So there’s several things that’ll slow down your code using C++ exception catching in emscripten, even if you never throw in the hot paths:

  • when catching, potentially throwing calls are indirected through JavaScript
  • destructors create implicit catch blocks, so you may be catching more than you think

There’s also some surprises:

  • in C++, every extern “C” function is potentially throwing!
  • “noexcept” functions that call potentially throwing functions add implicit catch blocks in order to call abort(), so they can guarantee that they do not throw exceptions themselves

So don’t just add “noexcept” to code willy-nilly or it may get worse.

It would be faster to use my own state flag and check it after every call (which could be hidden behind a template or a macro or something, probably, at least in part), since this would avoid calling out through JavaScript and indirecting all the calls.

Next I’ll look at another way in the Rust version, using Result<T, E> return types and the error propagation “?” shortcut in Rust…

Building Chromium for Windows ARM64

I heard a couple months back that someone had confirmed that you can build Chromium (open-source version of Chrome browser) for Windows ARM64… Since there’s still no general release, figured I’d give it a shot to test things in.

You have to build on Windows x64 (it won’t build on-device, it’d be too slow and may require running some x64 binaries from the tooling which wouldn’t work) and then copy the output over to an ARM64 device.

Some resources:

Tricky bits:

  • I built with Visual Studio 2019, but something in the build setup seems to want Visual Studio 2017’s C runtime library? Or something? Anyway I had to change a couple references in build/vs_toolchain.py from ‘Microsoft.VC141.CRT’ to ‘Microsoft.VC142.CRT’ and *.DebugCRT because I could not get it to look in the directory where the VC141 files were.
  • If you don’t turn on use_jumbo_build it will take a vveerryy lloonngg time. If you do, though, it will use more memory. Beware.
  • I had a lot of problems with the provided python instance not picking up imports from the depot_tools package, even after removing my other Python instances. Had to set PYTHONPATH=C:\src\depot_tools

My gn args for the build were:

# Required or else it defaults to x64
target_cpu = "arm64"

# To get a release-quality build for benchmarking
is_debug = false
is_component_build = false

# Coalesce multiple files and skip the Native Client stuff which won't be used in testing
use_jumbo_build = true
enable_nacl = false

I copied the resulting output directory over to the ARM64 machine in the morning; cd into the out dir and just run “chrome” and bam it works!

After enabling SIMD in chrome://flags, I tested ogv.js’s WebAssembly version of the dav1d AV1 decoder, with excellent results. Matching my earlier experiments with the Linux version under WSL (where I couldn’t get playback working due to missing audio and slow X11 drawing, but decoding benchmarks worked) I see decode times about 2x as fast as Firefox (which only uses a baseline compiler on ARM64, so isn’t as well optimized).

Enabling SIMD roughly doubles the throughput… and enabling threading roughly doubles it again.

As a result, I can get 720p24 AV1 video decoding and playing in WebAssembly fairly stably, with a few dropped frames here and there:

That’s not bad!

Very very much hoping that WebAssembly threading and SIMD will come to Safari, where we use ogv.js to play Ogg/WebM/VP9 (and in future AV1) video clips for Wikipedia. The recent model iPad and iPhone ARM64 chips are already amazing, and both desktops and mobiles will benefit from the performance improvements.

Also very much looking forward to Chrome and Edge coming to native ARM64 Windows. And looking forward to Firefox’s next-gen code generation engine, cranelift, improving their situation on these chips…. :)

Windows 10 on ARM64 testing

I’ve been curious about ARM64 (aarch64) and whether it’s up to the task of a modern, modest laptop for a while… finally picked up one of the second-generation Snapdragon 850-based Windows 10 “Always Connected PCs”, a Lenovo Yoga C630. It’s available in an 8gb RAM configuration which was enough to do some light development on, so I couldn’t help myself…

First thoughts on the machine: it’s not sure whether it’s a low-end or high-end product. Some aspects of it feel cheap, but nothing feels or works badly. The touchpad is decent enough, the keyboard is ok, and the 13″ 1080p screen is nicely colorful but feels a bit off. Some solid color areas look like it’s dithering visibly, which I’ve seen on cheaper LCDs. Speakers are definitely tinny. Fingerprint reader works ok for sign-in if you like that sort of thing.

First thoughts on the OS: it’s “just Windows 10”. :D Setup experience is like any other Win10 machine. It does start in “S mode” which limits you to store apps and built-ins… but you can switch that off in Settings at no cost.

This, I must point out, is where things fundamentally diverge from Microsoft’s previous Windows on ARM attempt, the Windows 8-era Windows RT. RT could not turn off the store restriction or the MS-signed restriction for Win32 apps, so you could only run Office (Win32) and whatever was in the Store (not much in those days).

In addition to being able to now run native ARM or ARM64 binaries — like Firefox! — from outside the store, you can now run 32-bit x86 binaries. Since most Windows software still ships 32-bit x86, this gives you a wide compatibility range. I installed Git for Windows, the Rust compiler, Visual Studio, all kinds of developer crud!

And, for extra fun the Windows Subsystem for Linux (version 1) is available, able to run aarch64 Linux binaries. I was able to build some of my test projects as well as real-world things like emscripten with Clang/LLVM under Ubuntu-in-WSL, running natively.

I was also able to enable the Windows Insider program to get the latest beta builds; ARM64 almost feels like a native part of the Windows ecosystem.

Well, sorta. :D

There are pain points. Emulated programs run a bit slower. Native binaries are rare. Building with Visual Studio is awkward because there’s no native ARM build tools, just cross tools you must run under emulation. If you wanted to do virtual machines in development, you’re stuck because there’s no Hyper-V on ARM64 (yet?).

But there are big pluses: battery life seems long, there’s built-in LTE which “just works” once set up, and the Linux environment should be adequate for a manual MediaWiki dev setup.

Performance of the CPU is surprisingly decent, though single-threaded throughput is slower than the A11 in my iPhone X. It also throttles pretty aggressively, shutting down some of the “big” powerful cores after a few seconds of sustained usage and diverting more threads to the “little” low-power cores. This makes things like an LLVM compile a lot slower than they’d be running full-tilt in a server or workstation environment. For a fanless laptop that’s good thermal management, but beware.

All in all I think I’m going to enjoy fiddling with this machine, and will find it useful for travel thanks to its light weight, LTE, and USB-C charging. But I can’t help think it’d be twice as cool if it ran stock Fedora or Ubuntu as the main OS. ;) (I have no idea if that’s theoretically possible if you disable secure boot. I’ll leave it as an exercise to someone!)

Wasm RPC thoughts

I spent far too much time lately going down a rabbit hole thinking about how to do a safe, reasonably efficient, reasonably sane API for cross-WebAssembly-module function calls and data transfer… this would allow a Wasm or JS “app” to interact with Wasm “plugins” with certain guarantees about the plugin’s ability to access host memory and functions.

Ended up putting together a fake “slide deck” to explain it to myself, which y’all are welcome to enjoy below. :)

I’ll keep this going in the background while I’m catching up on other stuff, then get it back up and running with my plugin research later.

SIMD in WebAssembly – tales from the bleeding edge

While benchmarking the AV1 and VP9 video decoders in ogv.js, I bemoaned the lack of SIMD vector operations in WebAssembly. Native builds of these decoders lean heavily on SIMD (AVX and SSE for x86, Neon for ARM, etc) to perform operations on 8 or 16 pixels at once… Turns out there has been movement on the WebAssembly SIMD proposal after all!

Chrome’s V8 engine has implemented it (warning: somewhat buggy still), and the upstream LLVM Wasm code generator will generate code for it using clang’s vector operations and some intrinsic functions.

emscripten setup

The first step in your SIMD journey is to set up your emscripten development environment for the upstream compiler backend. Install emsdk via git — or update it if you’ve got an old copy.

Be sure to update the tags as well:

./emsdk update-tags

If you’re on Linux you can download a binary installation, but there’s a bug in emsdk that will cause it not to update. (Update: this was fixed a few days ago, so make sure to update your emsdk!)

./emsdk install latest-upstream
./emsdk activate latest-upstream

On Mac or Windows, or to install the latest upstream source on purpose, you can have it build the various tools from source. There’s not a convenient “sdk” catch-all tag for this that I can see, so you may need to call out all the bits:

./emsdk install emscripten-incoming-64bit
./emsdk activate emscripten-incoming-64bit
./emsdk install upstream-clang-master-64bit
./emsdk activate upstream-clang-master-64bit
./emsdk install binaryen-master-64bit
./emsdk activate binaryen-master-64bit

First build may take a couple hours or so, depending on your machine.

Re-running the install steps will update the git checkouts and re-build, which doesn’t take as long as a fresh build usually but can still take some time.

Upstream backend differences

Be warned that with the upstream backend, emscripten cannot build asm.js, only WebAssembly. If you’re making mixed JS & WebAssembly builds this may complicate your build process, because you have to switch back.

You can switch back to the current fastcomp backend at any time by swapping your sdk state back:

./emsdk install latest
./emsdk activate latest

Note that every time you switch, the cached libc will get rebuilt on your next emcc invocation.

Currently there are some code-gen issues where the upstream backend produces more local variables and such than the older fastcomp backend, which can cause a slight slowdown in code of a few % (and for me, a bigger slowdown in Safari which does particularly well with the old compiler’s output). This is being actively worked on, and is expected to improve significantly soon.

Starting Chrome

Now you’ve got a compiler; you’ll also need a browser to run your code in. Chrome’s V8 includes support behind an experimental runtime flag; currently it’s not exposed to the user interface so you must pass it on the command line.

I recommend using Chrome Canary on Mac or Windows, or a nightly Chromium build on Linux, to make sure you’ve got any fixes that may have come in recently.

On Mac, one can start it like so:

/Applications/Google\ Chrome\ Canary.app/Contents/MacOS/Google\ Chrome\ Canary --js-flags="--experimental-wasm-simd"

If you forget the command-line flag or get it wrong, the module compilation won’t validate the SIMD instructions and will throw an exception, so you can’t run by mistake. (This is also apparently how you’re meant to test for SIMD support presence, AFAIK… compile a module and see if it works?)

Beware there are a few serious bugs in the V8 implementation, which may trip you up. In particular watch out for broken splat which can produce non-deterministic errors. For this reason I recommend disabling autovectorization for now, since you have no control of workarounds on the clang end. Non-constant bit shifts also fail to validate, requiring a scalar workaround.

Vector ops in clang

If you’re not working in C/C++ but are implementing your own Wasm compiler or hand-writing Wasm source, skip this section! ;) You’ll want to checkout the SIMD proposal documentation for a list of available instructions.

First, forget anything you may have heard about emscripten including MMX compatibility headers (xmmintrin.h etc). They’ve been recently removed as they’re broken and misleading.

There’s also emscripten/vector.h but it seems obsolete as well with some references to functions that don’t exist that were from the old SIMD.js implementation, and I recommend avoiding it for now.

The good news is, a bunch of vector stuff “just works” using standard clang syntax, and there’s a few intrinsic functions for particular operations like the bitselect instruction and vector shuffling.

First, take some plain ol’ code and compile it with SIMD enabled:

emcc -o foo.html -O3 -s SIMD=1 -fno-vectorize foo.c

It’s important for now to disable autovectorization since it tends to break on the V8 splat bug for me. In the future, you’ll want to leave it on to squeeze out the occasional performance increase without manual intervention.

If you’re not using headers which predefine standard vector types for you, you can create some convenient aliases like so (these are just the types I’ve used so far):

typedef int16_t int16x8 __attribute((vector_size(16)));

typedef uint8_t uint8x16 __attribute((vector_size(16)));
typedef uint16_t uint16x8 __attribute((vector_size(16)));
typedef uint32_t uint32x4 __attribute((vector_size(16)));
typedef uint64_t uint64x2 __attribute((vector_size(16)));

The expected float and signed/unsigned integer interpretations for 128 bits are available, and you can freely cast between them to reinterpret bit sizes.

To work around bugs in the “splat” operation that expands a scalar to a vector, I made inline helper functions for myself:

static volatile int junk = 0;

static inline int16x8 splat_const(const int16_t val) {
// Use this only on constants due to the v8 splat bug!
return (int16x8){
val, val, val, val,
val, val, val, val
};
}

static inline int16x8 splat_vec(const int16_t val) {
// Try to work around issues with broken splat in V8
// by forcing the input to be something that won't be reused.
const int guarded = val + junk;
return (int16x8){
guarded, guarded, guarded, guarded,
guarded, guarded, guarded, guarded
};
}

Once the bug is fixed, it’ll be safe to remove the ‘junk’ and ‘guarded’ bits and use a single splat helper function. Though I’m sure there’s got to be a visibly clearer way to do a splat than manually writing out all the lanes and having the compiler coalesce them into a single splat operation? o_O

The bitselect operation is also frequently necessary, especially since convenient operations like min, max, and abs aren’t available on integer vectors. You might or might not be able to do this some cleaner way without the builtin, but this seems to work:

static inline int16x8 select_vec(const int16x8 cond,
const int16x8 a,
const int16x8 b) {
return (int16x8)__builtin_wasm_bitselect(a, b, cond);
}

Note the order of parameters on the bitselect instruction and the builtin has the condition last — I found this maddening so my helper function has the condition first where my code likes it.

You can now make your own vector abs function:

static inline int16x8 abs_vec(const int16x8 v) {
return select_vec(v < splat_const(0), -v, v);
}

Note the < and – operators “just work” on the vectors, we only needed helper functions for the bitselect and the splat. And I’m still not confident I need those two?

You’ll also likely need the vector shuffle operation, which there’s a standard builtin for. For instance here I’m deinterleaving 16-bit pixels into 8-bit pixels and the extra high bytes:

static inline uint8x16 merge_pixels(const int16x8 work) {
return (uint8x16)__builtin_shufflevector((uint8x16)work, (uint8x16)work,
0, 2, 4, 6, 8, 10, 12, 14, // the 8 pixels we worked on
1, 3, 5, 7, 9, 11, 13, 15 // zeroes we don't need
);
}

Checking compiler output

To figure out what’s going on it helps a lot to disassemble the WebAssembly output to confirm what instructions are actually emitted — especially if you’re hoping to report a bug. Refer to the SIMD proposal for details of instructions and types used.

If you compile with -g your .wasm output will include function names, which make it much easier to read the disassembly!

Use wasm-dis like so:

wasm-dis foo.wasm > foo.wat

Load up the .wat in your code editor of choice (there are syntax highlighting plugins available for VS Code and I think Atom etc) and search for your function in the mountain of stuff.

Note in particular that bit-shift operations currently can produce big sequences of lane-shuffling and scalar bit-shifts. This is due to the LLVM compiler working around a V8 bug with bit-shifts, and will be fixed soon I hope.

If you wish, you can modify the Wasm source in the .wat file and re-assemble it to test subtle changes — use wasm-as for this.

Reporting bugs

You probably will encounter bugs — this is very bleeding-edge stuff! The folks working on it want your feedback if you’re working in this area, so please make the most of it by providing reproducible test cases for any bugs you encounter that aren’t chalked down to the existing splat argument corruption and non-constant shift bugs.

And beware that until the splat bug is fixed, non-deterministic problems are really easy to pop up.

The various trackers:

ogv.js 1.6.0 released with experimental AV1 decoding

After some additional fixes and experiments I’ve tagged ogv.js 1.6.0 and released it. As usual you can use ‘ogv’ package on npm or fetch the zip manually. This includes various fixes, including for some weird bugs!, and performance improvements on lower-end machines. Internals have been partially refactored to aid future maintenance, and experimental AV1 decoding has been added using VideoLAN’s dav1d decoder.

dav1d and SIMD

The dav1d AV1 decoder is now working pretty solidly, but slowly. I found that my test files were encoded at too high a quality and dialed them back to my actual target bitrate and find that performance improves as a consequence, so hey! Not bad. ;)

I’ve worked around a minor compiler issue in emscripten’s old “fastcomp” asm.js->wasm backend where an inner loop didn’t get unrolled, which improves decode performance by a couple percent. Upstream prefers to let the unroll be implicit, so I’m keeping this patch in a local fork for now.

I’ve also been reached out to by some folks working on the WebAssembly SIMD proposal, which should allow speeding up some of the slow filtering operations with optimized vector code! The only browser implementation of the proposal (which remains a bit controversial) is currently Chrome, with an experimental command-line flag, and the updated vectorization code is in the new WebAssembly compiler backend that’s integrated with upstream LLVM.

So I spent some time getting up and running on the new LLVM backend for emscripten, found a few issues:

  • emsdk doesn’t update the LLVM download properly so you can get stuck on an old version and be very confused — this is being fixed shortly!
  • currently it’s hard to use a single sdk installation for both modes at once, and asm.js compilation requires the old backend. So I’ve temporarily disabled the asm.js builds on my simd2 work branch.
  • multithreaded builds are broken atm (at least modularized, which only just got fixed on the main compiler so might need fixes for llvm backend)
  • use of atomics intrinsics in a non-multithreaded build results in a validation error, whereas it had been silently turned into something safe in the old backend. I had to patch dav1d with a “fake atomics” option to #define them away.
  • Non-SIMD builds were coming out with data corruption, which I tracked down to an optimizer bug which had just been fixed upstream the day before I reported it. ;)
  • I haven’t gotten as far as working with any of the SIMD intrinsics, because I’m getting a memory access out of bounds issue when engaging the autovectorizer. I narrowed down a test case with the first error and have reported it; not sure yet whether the problem is in the compiler or in Chrome/V8.

In theory autovectorization is likely to not do much, but significant gains could be made using intrinsics… but only so much, as the operations available are limited and it’s hard to tell what will be efficient or not.

Intermittent file breakages

Very rarely, some files would just break at a certain position in the file for reasons I couldn’t explain. I had one turn up during AV1 testing where a particular video packet that contained two frame OBUs had one OBU header appear ok and the second obviously corrupted. I tracked the corruption back from the codec to the demuxer to the demuxer’s input buffer to my StreamFile abstraction used for loading data from a seekable URL.

Turned out that the offending packet straddled a boundary between HTTP requests — between the second and third megabytes of the file, each requested as a separate Range-based XMLHttpRequest, downloaded as binary strings so the data can be accessed during progress events. But according to the network panel, the second and third megabytes looked fine…. but the *following* request turned up as 512 KiB. …What?

Dumping the binary strings of the second and third megabytes, I immediately realized what was wrong:

Enjoy some tasty binary strings!

The first requests were as expected showing 8-bit characters (ASCII and control chars etc). The request with the broken packet was showing CJK characters indicating the string had probably been misinterpreted as UTF-16

It didn’t take much longer to confirm that the first two bytes of the broken request were 0xFE 0xFF, a UTF-16 Byte Order Mark. This apparently overrides the “overrideMimeType” method’s x-user-defined charset, and there’s no way to override it back. Hypothetically you could probably detect the case and swap bytes back but I think it’s not actually worth it to do full streaming downloads within chunks for the player — it’s better to buffer ahead so you can play reliably.

For now I’ve switched it to use ArrayBuffer XHRs instead of binary strings, which avoids the encoding problem but means data can’t be accessed until each chunk has finished downloading.