devel – Page 4

Brain dump: x86 emulation in WebAssembly

This is a quick brain dump of my recent musings on feasibility of a WebAssembly-based in-browser emulator for x86 and x86-64 processors… partially in the hopes of freeing up my brain for main project work. ;)

My bigÂ side project for some time has been ogv.js, an in-browser videoÂ player framework which uses emscripten to cross-compile C libraries for the codecs into JavaScript or, experimentally, the new WebAssembly target. That got me interested in how WebAssemblyÂ works at the low level, and how C/C++ programs work, and how we can mishmash them together in ways never intended by gods or humans.

Specifically, I’m thinking it would be fun to make an x86-64 Linux process-level emulator built around a WebAssembly implementation. This would let you load a native LinuxÂ executable into a web browser and run it, say, on your iPad. Slowly. :)

System vs process emulation

SystemÂ emulators provide the functioning of an entire computer system, with emulated software-hardware interfaces: you load up a full kernel-mode operating system image which talks to the emulated hardware. This is what you use for playing old video games, or running an old or experimental operating system. This canÂ require emulating lots of detail behavior of a system, which might be tricky or slow, and programs may not integrate with a surrounding environment well because they live in a tiny computer within a computer.

Process emulatorsÂ work at the level ofÂ a single user-mode process,Â which means you only have to emulate up to the system call layer. Older Mac users may remember their shiny new Intel Macs running old PowerPCÂ applications through the “Rosetta” emulator for instance. QEMU on Linux can be set up to handle similar cross-arch emulated execution, for testing or to make some cross-compilation scenarios easier.

A process emulator has some attraction becauseÂ the model is simpler inside the process…Â If you don’t have to handle interrupts and task switches, youÂ can run more instructions together in a row;Â elide some state changes; all kinds of fun things. YouÂ might not have to implement indirect page tables for memory access. You might even be able to get away with modeling some function calls as function calls, and loops as loops!

WebAssembly instancesÂ and Linux processes

There are many similarities, which is no coincidence as WebAssembly is designed to run C/C++ programs similarly to how they work in Linux/Unix or Windows while being shoehornable into a JavaScript virtual machine. :)

An instantiated WebAssembly module has a “linear memory” (aÂ contiguous block of memoryÂ addressable via byte indexing), analogous to the address space of a LinuxÂ process. You can read and write int and float values of various sizes anywhere you like, and interpretation of bytewise data is up to you.

Like a nativeÂ process, the module can request more memory from the environment, which will be placed at the end. (“grow_memory” operator somewhat analogous toÂ Linux “brk” syscall, or some usages of “mmap”.) Unlike a native process, usable memory always startsÂ at 0 (so you can dereference a NULL pointer!) and there’s no way toÂ have a “sparse” address space by mapping things to arbitrary locations.

The module can also have “global variables” which live outside this address space — but they cannot be dynamically indexed, so you cannot have arrays or any dynamic structures there. In WebAssembly built viaÂ emscripten, globals are used only forÂ some special linking structures because they don’t quite map to any C/C++ construct, but hand-written code can use them freely.

The biggest difference from native processesÂ is that WebAssembly code doesn’t live in the linear memory space. FunctionÂ definitions have their own linear index space (which can’t be dynamically indexed: references are fixed at compile time), plus there’s a “table” of indirect function referencesÂ (whichÂ can be dynamically indexed into). Function pointers in WebAssemblyÂ thus aren’t actually pointers to the instructions in linear memory like on nativeÂ — they’re indexes into the table of dynamic function references.

Likewise,Â the call stack and local variables live outside linear memory. (Note that C/C++ code built with emscripten will maintainÂ its own parallel stack in linear memory in order to provide arrays,Â variables that haveÂ pointers taken to them,Â etc.)

WebAssembly’s actualÂ opcodes are oriented asÂ a stack machine,Â which is meant to be easy to verify and compile into more efficient register-based code at runtime.

Branching and control flow

In WebAssemblyÂ control flow is limited, with one-way branchesÂ possible only to a containingÂ block (i.e. breaking out of a loop).Â Subroutine calls are only to defined functions (either directly by compile-time reference, or indirectly via the function table)

Control flow is probably the hardest thing to make really match up from native code — which lets you jump to any instruction in memory from any other — to compiled WebAssembly.

It’s easy enough toÂ handle craaaazy native branching in an interpreter loop. Pseudocode:

loop {
instruction = decode_instruction(ip)
instruction.execute() // update ip and any registers, etc
}

In that case, a JMP or CALL or whateverÂ just updates the instruction pointer when you execute it, and you continue onÂ your merry way from the new position.

But what if we wanted to eke more performance out of it by compiling multiple instructions into a single function? That lets us elide unnecessary state changes (updating instruction pointers, registers, flags, etc when they’reÂ immediately overridden) and may even give opportunity to let the compiler re-optimize things further.

A start is to combine runs of instructions that end in a branch or system call (QEMU calls them “translation units”) into a compiled function, then call those in theÂ loop instead of individual instructions:

loopÂ {
tu = cached_or_compiled_tu(ip)
tu.execute() // update registers, ip, etc as we go
}

SoÂ instead of decoding and executing an instruction at a time, we’re decoding several instructions, compiling a new function that runs them, and then running that. Nice, if we have to run it multiple times! But…. possibly not worthÂ as much as we want, since a lot of those instruction runs will be really short, and there’ll be function call overhead on every run.Â And, itÂ seems like it would kill CPU branch prediction and such, by essentially moving all branches to a single place (the tu.execute()).

QEMU goes further in its dynamic translation emulators,Â modifying the TUs to branch directly to each other in runtime discovery. It’s all very funky and scary looking…

But QEMU’s technique of modifying trampolines in the live code won’t work as we can’t modify running code to insert jump instructions… and even if we could, there are noÂ one-way jumps, and using call instructions risks exploding the call stack on what’s actually a loop (there’s no proper tail call optimization in WebAssembly).

Relooper

What can be done, though, is to compile bigger, better, badder functions.

When emscripten is generating JavaScript or WebAssembly from your C/C++ program’s LLVM intermediate language, it tries to reconstruct high-level control structures within each function from a more limited soup of localÂ branches. These then get re-compiled back into branch soup by the JIT compiler, but efficiently. ;)

The binaryen WebAssembly code gen library providesÂ this “relooper” algorithmÂ too: you pass in blocks of instructions, possible branches, and the conditions around them, and it’ll spit outÂ some nicer branch structure if possible, or an ugly one if not.

I’m pretty sure it should be possible to take a detected loop cycle of separate TUs and create a combined TU that’s been “relooped” in a way that it is more efficient.

BBBBuuuuutttttt all this sounds expensive in terms of setup. Might want to hold off on any compilation until a loop cycle is detected, for instance, and just let the interpreter roll on one-off code.

ModifyingÂ runtime code in WebAssembly

Code is not addressableÂ or modifiable within a live module instance; unlike in native code you can’t justÂ write instructions into memory and jump to the pointer.

In fact, you can’t actually add code to a WebAssemblyÂ module.Â So how are we going to add our functions at runtime? There are two tricks:

First, multiple module instances can useÂ the same linear memory buffer.

Second, the tables for indirect function callsÂ canÂ list “foreign” functions, such as JavaScript functions or WebAssembly functions from a totally unrelated module. And those tables are modifiable at runtime (from the JavaScript side of the border).

These canÂ be used to do full-on dynamic linking of libraries, but all we really need is to be able to add a new functionÂ that can be indirect-called, which will run the compiled version of some number of instructions (perhaps even looping natively!) and then return back to the main emulator runtime when it reaches a branch it doesn’t contain.

Function calls

Since x86 has a nice handy CALL instruction, and doesn’t just rely on convention, it could be possible to model calls to already-cached TUs as indirect function calls, which may perform better than exiting out to the loop and coming back in. But they’d probably need to be guardedÂ for early exit, for several reasons… if we haven’t compiled the entirety of the relooped code path from start to exit of the function, thenÂ we have to exit back out. A guard check on IP and early-return should be able to do that in a fairly sane way.

function tu_1234() {
// loop
do {
// calc loop condition -> setÂ zero_flag
ip = 1235
if !zero_flag {
break
}
ip = 1236
// CALL 4567
tu = cached_or_compiled_tu(4567)
tu.execute()
if ip != 1236 {
//Â only partway through. back to emulator loop,
// possiblyÂ unwinding a long stack :)
return
}
// more code
ip
}
}

I think this makes some kind of sense. ButÂ if we’re decoding instructions + creating output on the fly, it could take a few iterations through to produce a full compiled set, and exiting a loop early might be … ugly.

It’s possible that all this is a horrible pipe dream, or would perform too bad for JIT compilationÂ anyway.

But it could still be fun for ahead-of-time compilation. ;) Which is complicated… a lot … by the fact that you don’t haveÂ the positions of all functions known ahead of time. Plus, if there’s dynamic linking or JIT compilationÂ inside the process, well, none of that’s even present ahead of time.

Prior art: v86

I’ve been looking at lot at v86, a JavaScript-based x86 system emulator. v86 is a straight-up interpreter, with instruction decoding and execution mixed together a bit, but it feels fairly straightforwardly written and easy to follow when IÂ look at things in the code.

v86 usesÂ a set of aliased typed arrays for the system memory, another set for the register file, and then some variables/properties for misc flags and things.

Some quick notes:

a register file in an array means accesses atÂ difference sizes are easy (al vs ax vs eax), and you can easily index into it from theÂ operandÂ selector bits from the instruction (as opposed to using a variable per register)
is there overhead from all the object property accesses etc? would it be more efficient toÂ do everything within a bigÂ linear memory?
as a system emulator there’s some extra overhead to things like protected mode memory accesses (page tables! who knows what!) that could be avoided on a per-process model
64-bit emulationÂ would be hard in JavaScript due to lack of 64-bit integers (argh!)
as an interpreter, instruction decode overhead is repeatedÂ during loops!
to avoid expensive calculations of the flagsÂ register bits, most arithmetic operations that would change the flags insteadÂ save the inputs for the flag calculations, which get done on demand. This still is often redundant because flags may get immediately rewritten by the next instruction, butÂ is cheaper than actually calculating them.

WebAssemblyÂ possibilities

First, since WebAssemblyÂ supports only oneÂ linear memoryÂ buffer at a time, the register file and perhaps some other data would need to live there. Most likely want a layout withÂ the register file and other data at theÂ beginning of memory, with the rest ofÂ memory after a fixed point belonging to the emulated process.

Putting all the emulator’s non-RAM state in the beginning meansÂ a process emulator can request more memory on demand via Linux ‘brk’ syscall, which would be implemented via the ‘grow_memory’ operator.

64-bit math

WebAssembly supports 64-bit integerÂ memory accesses and arithmetic, unlike JavaScript! The only limitation is that you can’t (yet) export a function that returns or accepts an i64 to or from JavaScript-land.Â That means if we keep our opcode implementations in WebAssembly functions, they can efficiently handle 64-bit ops.

However WebAssembly’s initial version allows only 32-bit memory addressing. ThisÂ may not be a huge problem for emulating 64-bit processes that don’t grow that large, though, as long as the executable doesn’t need to be loaded at a specific address (which would mean a sparse address space).

Sparse address spaces could be emulated with indirection into a “real” memory that’s in a sub-4GB space, which would be needed for a system emulator anyway.

Linux details

Statically linked ELF binaries would be easiest to model. More complex to do dynamic linking, need to pass a bundle of files in and doÂ fix-upsÂ etc.

Questions: are executables normally PIC as well as libraries, or do they want a default load address? (Which would break the direct-memory-access model and require some indirection for sparse address space.)

Answer: normally Linux x86_64 executables are not PIE, and want to be loaded at 0x400000 or maybe some other random place. D’oh! But… in the common case, youÂ could simplify that as a single offset.

Syscall on 32-bit is ‘intÂ $80’, or ‘syscall’ instruction on 64-bit. Syscalls wouldÂ probably mostly need to be implemented on the JS side,Â poking at the memory and registers of the emulated process state and then returning.

To do network i/o would probably need to be able to block and return to the emulator… so like a function call bowing out early due to an uncompiled branch being taken, would potentially need an “early exit” from the middle of a combined TU if itÂ does a syscall that ends up being async. On the other hand, if a syscall can be done sync, might be nice not to pay that penalty.

Could also need async syscalls for multi-process stuff via web workers… anything that must call back to main thread would need to do async.

For 64-bit, JS code would have to …. painfully … deal with 32-bit half-words. Awesome. ;)

Multiprocessing

WebAssembly initial version has no facility for multiple threads accessing the same memory, which means no threads. However this is planned to come in future…

Processes with separate address spaces could be implementedÂ by putting each process emulator in a Web Worker, and having them communicateÂ via messages sent to the main thread through syscalls. This forces any syscall that might need global state to be async.

Prior art: Browsix

BrowsixÂ provides a POSIX-like environment based around web techs, with processes modeled inÂ Web Workers and syscalls done via async messages. (C/C++ programs can be compiled to work in Browsix with a modified emscripten.) Pretty sweetÂ ideas. :)

I know they’re working on WebAssembly processes as well, and were looking into synchronous syscalls vi SharedArrayBuffer/Atomics as well, so this might be an interesting area to watch.

Could it be possible to make a Linux binary loader for the Browsix kernel? Maybe!

Would it be possible to make graphical Linux binaries work, with some kind of JS X11 or Wayland server? …mmmmmmaaaaybe? :D

Closing thoughts

This all sounds like tons of fun, but may have no use other than learning a lot aboutÂ some low-level tech that’s interesting.

ogv.js 1.4.0 released

ogv.js 1.4.0 is now released, with a .zip build or via npm. Will try to push it to Wikimedia next week or so.

Live demo available as always.

New A/V sync

Main improvement is much smoother performance on slower machines, mainly from changing the A/V sync method to prioritize audio smoothness, based on recommendations I’d received from video engineers at conferences that choppy audio is noticed by users much more strongly than choppy or out of sync video.

Previously, when ogv.js playback detected that video was getting behind audio, it would halt audio until the video caught up. This played all audio, and showed all frames, but could be very choppy if performance wasn’t good (such as in Internet Explorer 11 on an old PC!)

The new sync method instead keeps audio rock-solid, and allows video to get behind a little… if the video catches back up within a few frames, chances are the user won’t even notice. If it stays behind, we look ahead for the next keyframe… when the audio reaches that point, any remaining late frames are dropped. Suddenly we find ourselves back in sync, usually with not a single discontinuity in the audio track.

fastSeek()

The HTMLMediaElement API supports a fastSeek() method which is supposed to seek to the nearest keyframe before the request time, thus getting back to playback faster than a precise seek via setting the currentTime property.

Previously this was stubbed out with a slow precise seek; now it is actually fast. This enables a much better “scrubbing” experience given a suitable control widget, as can be seen in the demo by grabbing the progress thumb and moving it around the bar.

VP9 playback

WebM videos using the newer, more advanced VP9 codec can use a lot less bandwidth than VP8 or Theora videos, making it attractive for streaming uses. A VP9 decoder is now included for WebM, initially supporting profile 0 only (other profiles may or may not explode) — that means 8-bit, 4:2:0 subsampling.

Other subsampling formats will be supported in future, can probably eventually figure out something to do with 10-bit, but don’t expect those to perform well. :)

The VP9 decoder is moderately slower than the VP8 decoder for equivalent files.

Note that WebM is still slightly experimental; the next version of ogv.js will make further improvements and enable it by default.

WebAssembly

Firefox and Chrome have recently shipped support for code modules in the WebAssembly format, which provides a more efficient binary encoding for cross-compiled code than JavaScript. Experimental wasm versions are now included, but not yet used by default.

Multithreaded video decoding

Safari 10.1 has shipped support for the SharedArrayBuffer and Atomics APIs which allows for fully multithreaded code to be produced from the emscripten cross-compiler.

Experimental multithreaded versions of the VP8 and VP9 decoders are included, which can use up to 4 CPU cores to significantly increase speed on suitably encoded files (using the -slices option in ffmpeg for VP8, or -tile_columns for VP9). This works reasonably well in Safari and Chrome on Mac or Windows desktops; there are performance problems in Firefox due to deoptimization of the multithreaded code.

This actually works in iOS 10.3 as well — however Safari on iOS seems to aggressively limit how much code can be compiled in a single web page, and the multithreading means there’s more code and it’s copied across multiple threads, leading to often much worse behavior as the code can end up running without optimization.

Future versions of WebAssembly should bring multithreading there as well, and likely with better performance characteristics regarding code compilation.

Note that existing WebM transcodes on Wikimedia Commons do not include the suitable options for multithreading, but this will be enabled on future builds.

Misc fixes

Various bits. Check out the readme and stuff. :)

What’s next for ogv.js?

Plans for future include:

replace the emscripten’d nestegg demuxer with Brian Parra’s jswebm
fix the scaling of non-exact display dimensions on Windows w/ WebGL
enable WebM by default
use wasm by default when available
clean up internal interfaces to…
…create official plugin API for demuxers & decoders
split the demo harness & control bar to separate packages
split the decoder modules out to separate packages
Media Source Extensions-alike API for DASH support…

Those’ll take some time to get all done and I’ve got plenty else on my plate, so it’ll probably come in several smaller versions over the next months. :)

I really want to get a plugin interface so people who want/need them and worry less about the licensing than me can make plugins for other codecs! And to make it easier to test Brian Parra’s jsvpx hand-ported VP8 decoder.

An MSE API will be the final ‘holy grail’ piece of the puzzle toward moving Wikimedia Commons’ video playback to adaptive streaming using WebM VP8Â and/or VP9, with full native support in most browsers but still working with ogv.js in Safari, IE, and Edge.

Limitations of AVSampleBufferDisplayLayer on iOS

In my last post I described using AVSampleBufferDisplayLayer to outputÂ manually-uncompressed YUV video framesÂ in an iOS app, for playing WebM and Ogg files from Wikimedia Commons. After further experimentation I’ve decided to instead stick with using OpenGL ES directly, and here’s why…

640×360 output regularly displays with aÂ weird horizontal offset corruption on iPad Pro 9.7″. Bug filed asÂ rdar://29810344
Can’t get any pixel format with 4:4:4 subsampling to display. Theora and VP9 both support 4:4:4 subsampling, so thatÂ made some files unplayable.
Core Video pixel buffers for 4:2:2 and 4:4:4 are packed formats, and it prefers 4:2:0 to be a weird biplanar semi-packed format. This requires conversion from the planar output I already have, which may be cheap with NeonÂ instructions but isn’t free.

Instead,Â I’m treating each plane as a separate one-channel grayscale image, which works for any chroma subsampling ratios. I’m using some Core Video bits (CVPixelBufferPool and CVOpenGLESTextureCache) to doÂ texture setupÂ instead of manuallyÂ callingÂ glTeximage2d with a raw source blob, which improvesÂ a few things:

Can do CPU->GPU memory copy off main thread easily,Â without worrying about locking my GL context.
No pixel format conversions, so straight memcpy for each line…
Buffer pools are tied to the video buffer’s format object, and get swapped out automaticallyÂ when the formatÂ changes (new file, or file changes resolution).
Don’t have to manually account for stride != width in the texture setup!

ItÂ couldÂ be more efficient still if I could pre-allocate CVPixelBuffers with on-GPU memory and hand them to libvpx and libtheora to decode into… but they currently lack sufficient interfaces to accept frame buffers with GPU-allocated sizes.

A few other oddities I noticed:

The clean aperture rectangle setting doesn’t seem to be preserved when creating a CVPixelBuffer via CVPixelBufferPool; I have to re-set it when creating new buffers.
For grayscaleÂ buffers, the clean aperture doesn’t seem to be picked up by CVOpenGLESTextureGetCleanTexCoords. Not sure if this is only supposed to work with Y’CbCr buffer types or what… however I already have all these numbers in myÂ format object and just pull from there. :)

I also fell down a rabbit hole researching color space issues after noticing thatÂ some of the video formats support multipleÂ colorspace variants that may imply different RGB conversion matrices… and maybe gamma…. and what do R, G, and B mean anyway? :) Deserves another post sometime.

Drawing uncompressed YUV frames on iOS with AVSampleBufferDisplayLayer

One of my little projects isÂ OGVKit, a library for playingÂ Ogg and WebM mediaÂ on iOS, which at some point I want to integrate into the Wikipedia app to fix audio/video playback in articles. (We don’t use MP4/H.264Â dueÂ to patent licensing concerns, but Apple doesn’t support these formats, so we have to jump through some hoops…)

A trick withÂ working with digital video is that video frames are usually processed,Â compressed, and stored using the YUV (aka Y’CbCr) colorspace instead of the RGB used inÂ the rest of the digital display pipeline.

This means that you can’t just take the output from a video decoder andÂ blit it to the screenÂ — you need to know how toÂ dig out the pixel data and recombine it into RGB first.

Currently OGVKit draws frames using OpenGL ES, manually attaching the YUV planes as separate textures and doing conversion to RGB in a shader — I actually portedÂ it overÂ from ogv.js‘s WebGL drawing code.Â But surely a system like iOS with pervasive hardware-acceleratedÂ video playback already has some handy way to draw YUV frames?

While researching working with system-standard CMSampleBuffer objects to replace my custom OGVVideoBuffer class, I discovered that iOS 8 and later (and macOS version something) do have a such handy output path: AVSampleBufferDisplayLayer. This guy has threeÂ special tricks:

CMSampleBuffer objects go in, pretty pictures on screen come out!
CanÂ manageÂ a queue of buffers,Â synchronizing display times to a providedÂ clock!
If you pass compressed H.264 buffers, itÂ handles decompression transparently!

I’m decompressing from aÂ format AVFoundation doesn’t grok so the transparent decompression isn’t interesting to me, butÂ since it claimed to accept uncompressed buffers too I figured this might simplify myÂ display output path…

The queue system sounds like it might simplify my timing and state management, but is a bigger change to my code to make so I haven’t tried it yet. You can also tell it to display one frame at a time, which means I can use my existing timing code for now.

There are however two major caveats:

AVSampleBufferDisplayLayer isn’t available on tvOS… so I’ll probably end upÂ repackaging the OpenGL output path as an AVSampleBufferDisplayLayer lookalike eventually to try an Apple TV port. :)
Uncompressed frames must be in a very particular format or you get no visible output and no error messages.

Specifically, it wants a CMSampleBuffer backed by a CVPixelBufferÂ that’sÂ IOSurface-backed,Â using bi-planar YUV 4:2:0 pixel format (kCVPixelFormatType_420YpCbCr8BiPlanarVideoRange
or kCVPixelFormatType_420YpCbCr8BiPlanarFullRange). However libtheora and libvpx produce output in traditionalÂ tri-planar format, with separate Y, U and V planes.Â This meant I had toÂ createÂ buffers in appropriate format with appropriate backing memory, copy the Y plane, and thenÂ interleave the U and V planes into a single chroma muddle.

My first super-naive attempt took 10ms per 1080p frame to copy on an iPad Pro, which pretty solidly negated any benefits of using a system utility. Then I realized IÂ had a really crappy loop around every pixel. ;)

Using memcpy — a highly optimized system function — to copy the luma lines cut the time down to 3-4ms per frame. AÂ little loop unrolling on the chroma interleave brought it to 2-3ms, and I was able to get it down to about 1ms per frame usingÂ a coupleÂ ARM-specific vectorÂ intrinsic functions, inspired by assembly code I found googling around forÂ YUV layout conversions.

It turns out you canÂ interleave 8 pixels at a time in three instructions using two vector reads and one write, and I didn’t even have to dive into actual assembly:

static inline void interleave_chroma(unsigned char *chromaCbIn, unsigned char *chromaCrIn, unsigned char *chromaOut) {
#if defined(__arm64) || defined(__arm)
    uint8x8x2_t tmp = { val: { vld1_u8(chromaCbIn), vld1_u8(chromaCrIn) } };
    vst2_u8(chromaOut, tmp);
#else
    chromaOut[0] = chromaCbIn[0];
    chromaOut[1] = chromaCrIn[0];
    chromaOut[2] = chromaCbIn[1];
    chromaOut[3] = chromaCrIn[1];
    chromaOut[4] = chromaCbIn[2];
    chromaOut[5] = chromaCrIn[2];
    chromaOut[6] = chromaCbIn[3];
    chromaOut[7] = chromaCrIn[3];
    chromaOut[8] = chromaCbIn[4];
    chromaOut[9] = chromaCrIn[4];
    chromaOut[10] = chromaCbIn[5];
    chromaOut[11] = chromaCrIn[5];
    chromaOut[12] = chromaCbIn[6];
    chromaOut[13] = chromaCrIn[6];
    chromaOut[14] = chromaCbIn[7];
    chromaOut[15] = chromaCrIn[7];
#endif
}

ThisÂ mightÂ be even faster if copying is doneÂ on a “slice” basis during decoding, while the bits of the frame being copied are in cache, but I haven’t tried this yet.

With the more efficient copies, the AVSampleBufferDisplayLayer-based outputÂ doesn’t seem to use more CPU than the OpenGL version, and using CMSampleBuffers should allow me to take output from the Ogg and WebM decoders and feed it directly into an AVAssetWriter forÂ conversion into MP4… from there it’s a hop, skip and a jump to going the other way, converting on-device MP4 videos into WebM for upload toÂ Wikimedia Commons…

Testing in-browser video transcoding with MediaRecorder

A few months ago I made a quick test transcoding video from MP4 (or whatever else the browser can play) into WebM using the in-browser MediaRecorder API.

I’ve updated it to work in Chrome, using a <canvas> element as an intermediary recording surface as captureStream() isn’t available on <video> elements yet there.

Live demo: https://brionv.com/misc/browser-transcode-test/capture.html

There are a couple advantages of re-encoding a file this way versus trying to do all the encoding in JavaScript, but also some disadvantages…

Pros

actual encoding should use much less CPU than JavaScript cross-compile
less code to maintain!
don’t have to jump through hoops to get at raw video or audio data

Cons

MediaRecorder is realtime-oriented:
- will never decode or encode faster than realtime
- if encoding is slower than realtime, lots of frames are dropped
- on my MacBook Pro, realtime encoding tops out around 720p30, but eg phone camera videos will often be 1080p30 these days.
browser must actually support WebM encoding or it won’t work (eg, won’t work in Edge unless they add it in future, and no support at all in Safari)
Firefox and Chrome both seem to be missing Vorbis audio recording needed for base-level WebM (but do let you mix Opus with VP8, which works…)

So to get frame-rate-accurate transcoding, and to support higher resolutions, it may be necessary to jump through further hoops and try JS encoding.

I know this can be done — there are some projects compiling the entire ffmpegÂ package in emscripten and wrapping it in a converter tool — but we’d have to avoid shipping an H.264 or AAC decoder for patent reasons.

So we’d have to draw the source <video> to a <canvas>, pull the RGB bits out, convert to YUV, and run through lower-level encoding and muxing… oh did I forget to mention audio? Audio data can be pulled via Web Audio, but only in realtime.

So it may be necessary to do separate audio (realtime) and video (non-realtime) capture/encode passes, then combine into a muxed stream.

Canvas, Web Audio, MediaStream oh my!

I’ve often wished that for ogv.js I could send my raw video and audio output directly to a “real” <video> element for rendering instead of drawing on a <canvas> and playing sound separately to a Web Audio context.

In particular, things I want:

Not having to convert YUV to RGB myself
Not having to replicate the behavior of a <video> element’s sizing!
The warm fuzzy feeling of semantic correctness
Making use of browser extensions like control buttons for an active video element
Being able to use browser extensions like sending output to ChromeCast or AirPlay
Disabling screen dimming/lock during playback

This last is especially important for videos of non-trivial length, especially on phones which often have very aggressive screen dimming timeouts.

Well, in some browsers (Chrome and Firefox) now you can do at least some of this. :)

I’ve done a quick experiment using the <canvas> element’s captureStream() method to capture the video output — plus a capture node on the Web Audio graph — combining the two separate streams into a single MediaStream, and then piping that into a <video> for playback. Still have to do YUV to RGB conversion myself, but final output goes into an honest-to-gosh <video> element.

To my great pleasure it works! Though in Firefox I have some flickering that may be a bug, I’ll have to track it down.

Some issues:

Flickering on Firefox. Might just be my GPU, might be something else.
The <video> doesn’t have insight to things like duration, seeking, etc, so can’t rely on native controls or API of the <video> alone acting like a native <video> with a file source.
Pretty sure there are inefficiencies. Have not tested performance or checked if there’s double YUV->RGB->YUV->RGB going on.

Of course, Chrome and Firefox are the browsers I don’t need ogv.js for for Wikipedia’s current usage, since they play WebM and Ogg natively already. But if Safari and Edge adopt the necessary interfaces and WebRTC-related infrastructure for MediaStreams, it might become possible to use Safari’s full screen view, AirPlay mirroring, and picture-in-picture with ogv.js-driven playback of Ogg, WebM, and potentially other custom or legacy or niche formats.

Unfortunately I can’t test whether casting to a ChromeCast works in Chrome as I’m traveling and don’t have one handy just now. Hoping to find out soon! :D

JavaScript async/await fiddling

I’ve been fiddling with using ECMAScript 2015 (“ES6”) in rewriting some internals for ogv.js, both in order to make use of the Promise pattern for asynchronous code (to reduce “callback hell”) and to get cleaner-looking code with the newer class definitions, arrow functions, etc.

To do that, I’ll need to use babel to convert the code to the older ES5 version to run in older browsers like Internet Explorer and old Safari releases… so why not go one step farther and use new language features like asynchronous functions that are pretty solidly specced but still being implemented natively?

Not yet 100% sure; I like the slightly cleaner code I can get, but we’ll see how it functions once translated…

Here’s an example of an in-progress function from my buffering HTTP streaming abstraction, currently being rewritten to use Promises and support a more flexible internal API that’ll be friendlier to the demuxers and seek operations.

I have three versions of the function: one using provisional ES2017 async/await, one using ES2015 Promises directly, and one written in ES5 assuming a polyfill of ES2015’s Promise class. See the full files or the highlights of ES2017 vs ES2015:

The first big difference is that we don’t have to start with the “new Promise((resolve,reject) => {…})” wrapper. Declaring the function as async is enough.

Then we do some synchronous setup which is the same:

Now things get different, as we perform one or two asynchronous sub-operations:

In my opinion the async/await code is cleaner:

First it doesn’t have as much extra “line noise” from parentheses and arrows.

Second, I can use a try/finally block to do the final state change only once instead of on both .then() and .catch(). Many promise libraries will provide an .always() or something but it’s not standard.

Third, I don’t have to mentally think about what the “intermittent return” means in the .then() handler after the triggerDownload call:

Here, returning a promise means that that function gets executed before moving on to the next .then() handler and resolving the outer promise, whereas not returning anything means immediate resolution of the outer promise. It ain’t clear to me without thinking about it every time I see it…

Whereas the async/await version:

makes it clear with the “await” keyword what’s going on.

Updated: I managed to get babel up and running; here’s a github gist with expanded versions after translation to ES5. The ES5 original is unchanged; the ES2015 promise version is very slightly more verbose, and the ES2017 version becomes a state machine monstrosity. ;) Not sure if this is ideal, but it should preserve the semantics desired.

Exploring VP9 as a progressive still image codec

At Wikipedia we have long articles containing many images, some of which need a lot of detail and others which will be scrolled past or missed entirely. We’re looking into lazy-loading, alternate formats such as WebP, and other ways to balance display density vs network speed.

I noticed that VP9 supports scaling of reference frames from different resolutions, so a frame that changes video resolution doesn’t have to be a key frame.

This means that a VP9-based still image format (unlike VP8-based WebP) could encode multiple resolutions to be loaded and decoded progressively, at each step encoding only the differences from the previous resolution level.

So to load an image at 2x “Retina” display density, we’d load up a series of smaller, lower density frames, decoding and updating the display until reaching the full size (say, 640×360). If the user scrolls away before we reach 2x, we can simply stop loading — and if they scroll back, we can pick up right where we left off.

I tried hacking up vpxenc to accept a stream of concatenated PNG images as input, and it seems plausible…

Demo page with a few sample images (not yet optimized for network load; requires Firefox or Chrome):
https://media-streaming.wmflabs.org/pic9/

Compared to loading a series of intra-coded JPEG or WebP images, the total data payload to reach resolution X is significantly smaller. Compared against only loading the final resolution in WebP or JPEG, without any attempt at tuning I found my total payloads with VP9 to be about halfway between the two formats, and with tuning I can probably beat WebP.

Currently the demo loads the entire .webm file containing frames up to 4x resolution, seeking to the frame with the target density. Eventually I’ll try repacking the frames into separately loadable files which can be fed into Media Source Extensions or decoded via JavaScript… That should prevent buffering of unused high resolution frames.

Some issues:

Changing resolutions always forces a keyframe unless doing single-pass encoding with frame lag set to 1. This is not super obvious, but is neatly enforced in encoder_set_config in vp9_cx_iface.h! Use –passes=1 –lag-in-frames=1 options to vpxenc.

Keyframes are also forced if width/height go above the “initial” width/height, so I had to start the encode with a stub frame of the largest size (solid color, so still compact). I’m a bit unclear on whether there’s any other way to force the ‘initial’ frame size to be larger, or if I just have to encode one frame at the large size…

There’s also a validity check on resized frames that forces a keyframe if the new frame is twice or more the size of the reference frame. I used smaller than 2x steps to work around this (tested with steps at 1/8, 1/6, 1/4, 1/2, 2/3, 1, 3/2, 2, 3, 4x of the base resolution).

I had to force updates of the golden & altref on every frame to make sure every frame ref’d against the previous, or the decoder would reject the output. –min-gf-interval=1 isn’t enough; I hacked vpxenc to set the flags on the frame encode to VP8_EFLAG_FORCE_GF | VP8_EFLAG_FORCE_ARF.

I’m having trouble loading the VP9 webm files in Chrome on Android; I’m not sure if this is because I’m doing something too “extreme” for the decoder on my Nexus 5x or if something else is wrong…

Scaling video playback on slow and fast CPUs in ogv.js

Video playback has different performance challenges at different scales, and mobile devices are a great place to see that in action. Nowhere is this more evident than in the iPhone/iPad lineup, where the same iOS 9.3 runs across several years worth of models with a huge variance in CPU speeds…

In ogv.js 1.1.2 I’ve got the threading using up to 3 threads at maximum utilization (iOS devices so far have only 2 cores): main thread, video decode thread, and audio decode thread. Handling of the decoded frames or audio packets is serialized through the main thread, where the player logic drives the demuxer, audio output, and frame blitting.

On the latest iPad Pro 9.7″, advertising “desktop-class performance”, I can play back the Blender sci-fi short Tears of SteelÂ comfortably at 1080p24 in Ogg Theora:

The performance graph shows frames consistently on time (blue line is near the red target line) and a fair amount of headroom on the video decode thread (cyan) with a tiny amount of time spent on the audio thread (green) and main thread (black).

At this and higher resolutions, everything is dominated by video decode time — if we can keep up with it we’re golden, but if we get behind everything would ssllooww ddoownn badly.

On an iPad Air, two models behind, we get similar performance on the 720p24 version, at about half the pixels:

We can see the blue bars jumping up once a second, indicating sensitivity to the timing report and graph being updated once a second on the main thread, but overall still good. Audio in green is slightly higher but still ignorable.

On a much older iPad 3, another two models behind, we see a very different graph as we play back a mere 240p24 quarter-SD resolution file:

The iPad 3 has an older generation, 32-bit processor, and is in general pretty sluggish. Even at a low resolution, we have less headroom for the cyan bars of the video decode thread. Blue bars dipping below the red target line show we’re slipping on A/V sync sometimes. The green bars are much higher, indicating the audio decode thread is churning a lot harder to keep our buffers filled. Last but not least the gray bars at the bottom indicate more time spent in demuxing, drawing, etc on the main thread.

On this much slower processor, pushing audio decoding to another core makes a significant impact, saving an average of several milliseconds per frame by letting it overlap with video decoding.

The gray spikes from the main thread are from the demuxer, and after investigation turn out to be inflated by per-packet overhead on the tiny Vorbis audio packets… Such as adding timestamps to many of the packets. Ogg packs multiple small packets together into a single “page”, with only the final packet at the end of the page actually carrying a timestamp. Currently I’m using liboggz to encapsulate the demuxing, using its option to automatically calculate the missing timestamp deltas from header data in the packets… But this means every few frames the demuxer suddenly releases a burst of tiny packets with a 15-50ms delay on the main thread as it walks through them. On the slow end this can push a nearly late frame into late territory.

I may have further optimizations to make in keeping the main thread clear on slower CPUs, such as more efficient handling of download progress events, but overlapping the video and audio decode threads helps a lot.

On other machines like slow Windows boxes with blacklisted graphics drivers, we also benefit from firing off the next video decode before drawing the current frame — if WebGL is unexpectedly slow, or we fall back to CPU drawing, it may take a significant portion of our frame budget just to paint. Sending data down to the decode thread first means it’s more likely that the drawing won’t actually slow us down as much. This works wonders on a slow ARM-based Windows RT 8.1 Surface tablet. :)

Thoughts on Ogg adaptive streaming

So I’d like to use adaptive streaming for video playback on Wikipedia and Wikimedia Commons, automatically selecting the appropriate source format and resolution at runtime based on bandwidth and CPU availability.

For Safari, Edge, and IE users, that means figuring out how to rig a Media Source Extensions-like interface into ogv.js to let the streaming handler inject its buffered data into the demuxer and codecs instead of letting the player handle its own buffering.

It also means I have to figure out how to do adaptive stream switching for Ogg streams and Theora video, since WebM VP8 still decodes too slowly in ogv.js to rely on for deployment…

Theory vs Theora

At its base, adaptive streaming relies on the ability to feed the decoders with data from another stream without them freaking out and demanding a pause or reset. We can either read packets from a subset of a monolithic file for each source, or from a bunch of tiny segmented files.

In order to do this, generally you need to switch on video keyframe boundaries: each keyframe represents a point in the data stream where the video decoder can reset its state.

For WebM with VP8 and VP9 codecs, the decoders are pretty good at this. As long as you came in on a keyframe boundary you can just start feeding it packets at a new resolution and it’ll happily output frames at the new resolution.

For Ogg Theora, there are a few major impediments.

Ogg stream serial numbers

At the Ogg stream level: each Ogg logical bitstream gets a random serial number; those serial numbers will not match across separate encodings at different resolutions.

Ogg explicitly allows for “chaining” of complete bitstreams, where one ends and you just tack another on, but we’re not quite doing that here… We want to be able to switch partway through with minimal interruption.

For Vorbis audio, this might require some work if pulling audio+video together from combined .ogv files, but it gets simpler if there’s one .oga audio stream and separate video-only .ogv streams — we’d essentially have separate demuxer contexts for audio and video, and would not need to meddle with the audio.

For the Theora video stream this is probably ok too, since when we reach a switch boundary we also need to feed the decoder with…

Header packets

Every Theora video stream sets up start codes at the beginning of the stream in its three header packets. This means that encodings of the same video at different resolutions will have different header setup.

So, when we switch sources we’ll need to reinitialize the Theora decoder with the header packets from the target stream; then it should be safe to feed new packets into it from our arbitrary start position.

This isn’t a super exotic requirement; I’ve seen some provision for ‘start codes’ for MP4 adaptive streaming too.

Keyframe timing

More worrisome is that keyframe timing is not predictable in a Theora stream. This is actually due to the libtheora encoder internals — it allows you to specify a maximum keyframe interval, but it may decide at any time to insert a keyframe on its own if it thinks it’s more efficient to store a frame that way, at which point the interval starts counting from there instead of the last scheduled keyframe.

Since this heuristic is determined based on actual frame data, the early keyframes will appear in different times and places for renderings at different resolutions… And so will every keyframe following them.

This means you don’t have switch points that are consistent between sources, breaking the whole model!

It looks like a keyframe can be forced by changing the keyframe interval to 1 right before a desire keyframe, then changing it back to the desired value after. This would result in still getting some early keyframes at unpredictable times, but then also getting predictable ones. As long as the switchover points aren’t too often, that’s probably fine — just keep decoding over the extra keyframes, but only switch/segment on the predictable ones.

Streams vs split files

Another note: it’s possible to either store data as one long file per source, or to split it up into small chunk files at each keyframe boundary.

Chunk files are nice because they can be streamed easily without use of the HTTP ‘Range’ header and they’re friendly to cache layers. Long files can be easier to manage on the server, but Wikimedia ops folks have told me that the way large files are stored doesn’t always interact ideally with the caching layer and they’d be much happier with split chunk files!

A downside of chunks is that it’s harder to download a complete copy of a file at a given resolution for offline playback. But, if we split audio and video tracks we’re in a world where that’s hard anyway… Can either just say “download the full resolution source then!” Or provide a remuxer to produce combined files for download on the fly from the chunks… :)
The keyframe timing seems the ugliest issue to deal with; may need to patch ffmpeg2theora or ffmpeg to work around it, but at least shouldn’t have to mess with libtheora itself…