ogv.js experimental AV1 decoding

The upcoming ogv.js 1.6.0 release will be the first to include experimental AV1 support, using the dav1d decoder. Thanks to ePirat for the initial work in emscripten cross-compiling the dav1d codebase!

Performance is not very great, but may improve a bit in future from optimizations and potentially a lot from new platform features that may come to WebAssembly in the future.

In particular on Internet Explorer which lacks WebAssembly, performance is very poor but does work at very low resolutions on reasonably fast machines.

On my 2015 MacBook Pro (3.1 GHz 5th-gen Core i7), I can get somewhere between 360p and 480p on the “Caminandes – Llamigos” demo in Safari, while the current VP9 codec gives me 720p.

Safari has a great WebAssembly engine, giving 720p for VP9 or a solid 360p for AV1. 480p AV1 would be achievable with threading.

In IE 11, high-motion scenes in AV1 top up the CPU at only 120p, while VP9 gets away with 240p or so.

IE 11 runs several resolution steps lower, limited by its slow JavaScript engine. It will never get faster, we can only hope it will be gradually replaced.

Multithreaded WebAssembly builds are also included, thanks to emscripten fixing support for modularized threaded programs in 1.38.27. These however do not work in Safari because it has not yet added back SharedArrayBuffer support after it was removed as part of Spectre mitigations.

You can test the threaded builds in Chrome and Firefox with suitable flags enabled (“Wasm threading” for Chrome and “shared memory” for Firefox). VP9 scales well to 2 or 4 threads depending on the resolution, and AV1 scales to 2-ish threads. Will continue to tune and work on this for the future day when Safari supports threading.

Another area where WebAssembly doesn’t perform well is the lack of SIMD instructions — in many places there are tight loops of arithmetic that can be parallelized with vector computation, and native builds of the decoders make extensive use of SIMD. There is some experimental support in some browsers and emscripten but I’m not sure how well they talk to each other or how finalized the standard is so I haven’t tried it.

(It’s conceivable that browser engines could auto-vectorize tight loops in WebAssembly but they would probably be limited to 32-bit arithmetic at best, which wouldn’t parallelize as much as things that can work with 16-bit or 8-bit wide lanes.)