VP9 decoder hotspots in asm.js: multiplication

When using ogv.js in the old IE 11 browser to play VP9 video, one of the biggest CPU hotspots is the vpx_convolve8_c function, which applies pixel filtering in a target area.

The C looks like:

All the heavy lifting being in those two functions convolve_horiz and convolve_vert:

It looks a little oversimple with tight loops and function calls, but everything is inlined aggressively by the compiler and the inner loop unrolled. The asm.js looks something like:

(Note some long lines are cut off in the unrolled loop.)

This seems fairly optimal, but know that those multiplications are slow — they’ll be floating-point multiplication because the semantics of JavaScript’s floating-point multiply operator don’t lend themselves well to automatic consolidation into integer multiplication. And it’s already an optimization that it’s doing _that_!

Normally emscripten actually doesn’t emit a multiply operator for an integer multiplication like this — it instead emits a call to Math.imul which implements 32-bit integer multiplication correctly and, when implemented, quickly. But in IE 11 there’s no native Math.imul instruction because it’s older than that addition to the JavaScript standard…

The emscripten compiler can provide an emulated replacement for Math.imul when using the LEGACY_VM_SUPPORT option, but it’s very slow — a function call, two multiplications, some bit-shifts, and addition.

Since I know (hope?) the multiplications inside libvpx never overflow, I run a post-processing pass on the JavaScript that replaces the very slow Math.imul calls with only moderately slow floating-point multiplications. This made a significant difference to total speed, something like 10-15%, when I added it to our VP8 and VP9 decoding.

Unfortunately optimizing it further looks tricky without SIMD optimizations. The native builds of these libraries make aggressive use of SIMD (single-instruction-multiple-data) to apply these filtering steps to several pixels at once, and it makes a huge improvement to throughput.

There has been experimentation for some time in SIMD support for asm.js, which seems to be being dropped now in favor of moving it directly into WebAssembly. If/when this eventually arrives in Safari it’ll be a big improvement there — but IE 11 will never update, being frozen in time.

Revisiting AVSampleBufferDisplayLayer on iOS 11

When I tried using iOS’s AVSampleBufferDisplayLayer in OGVKit last year, I had a few problems. Most notably a rendering bug on one device, and inability to display frames with 4:4:4 subsampling (higher chroma quality).

Since the rendering path I used instead is OpenGL-based, and OpenGL is being deprecated in iOS 12… figured it might be worth another look rather than converting the shader and rendering code to Metal.

The rendering bug that was striking at 360p on some devices was fixed by Apple in iOS 11 (thanks all!), so that’s a nice improvement!

Fiddling around with the available pixel formats, I found that while 4:4:4 YCbCr still doesn’t work, 4:4:4:4 AYCbCr does work in iOS 11 and iOS 12 beta!

First I swapped out the pixel format without adjusting my data format conversions, which produced a naturally corrupt image:

Then I switched up my SIMD-accelerated sample format conversion to combine the three planes of data and a fourth fixed alpha value, but messed it up and got this:

Turned out to be that I’d only filled out 4 items of a 16×8 vector literal. Durrrr. :) Fixed that and it’s now preeeeetty:

(The sample video is a conversion of a silly GIF, with lots of colored text and sharp edges that degrade visibly at 4:2:0 and 4:2:2 subsampling. I believe the name of the original file was “yo-im-hacking-furiously-dude.gif”)

A remaining downside is that on my old 32-bit devices stuck on iOS 9 and iOS 10, the 4:2:2 and 4:4:4 samples don’t play. So either I need to keep the OpenGL code path for them, or just document it won’t work, or maybe do a runtime check and downsample to 4:2:0.

Mobile video encoding tradeoffs

I spent a little time yesterday and today poking at an old project to encode video into the WebM format we use at Wikipedia on your iPhone or iPad so you could, potentially, take video and upload it directly.

The default encoding settings were mmmuuuucccchhhh tttoooooo ssslllooowww to be practical, so I had to tune it to use a faster configuration. But to maintain quality, you have to bump up the bitrate (and thus file size) significantly.

This same tradeoff is made in the hardware video encoders in the device, too! When you’re making a regular camera recording the bitrate is actually several times higher than it would be on a typical download/stream of the same video from YouTube/Netflix/etc. You just don’t have the luxury of the extra encoding time on a modest mobile chip, especially not if you’re recording live.