ogv.js 1.6.0 released with experimental AV1 decoding

After some additional fixes and experiments I’ve tagged ogv.js 1.6.0 and released it. As usual you can use ‘ogv’ package on npm or fetch the zip manually. This includes various fixes, including for some weird bugs!, and performance improvements on lower-end machines. Internals have been partially refactored to aid future maintenance, and experimental AV1 decoding has been added using VideoLAN’s dav1d decoder.

dav1d and SIMD

The dav1d AV1 decoder is now working pretty solidly, but slowly. I found that my test files were encoded at too high a quality and dialed them back to my actual target bitrate and find that performance improves as a consequence, so hey! Not bad. ;)

I’ve worked around a minor compiler issue in emscripten’s old “fastcomp” asm.js->wasm backend where an inner loop didn’t get unrolled, which improves decode performance by a couple percent. Upstream prefers to let the unroll be implicit, so I’m keeping this patch in a local fork for now.

I’ve also been reached out to by some folks working on the WebAssembly SIMD proposal, which should allow speeding up some of the slow filtering operations with optimized vector code! The only browser implementation of the proposal (which remains a bit controversial) is currently Chrome, with an experimental command-line flag, and the updated vectorization code is in the new WebAssembly compiler backend that’s integrated with upstream LLVM.

So I spent some time getting up and running on the new LLVM backend for emscripten, found a few issues:

  • emsdk doesn’t update the LLVM download properly so you can get stuck on an old version and be very confused — this is being fixed shortly!
  • currently it’s hard to use a single sdk installation for both modes at once, and asm.js compilation requires the old backend. So I’ve temporarily disabled the asm.js builds on my simd2 work branch.
  • multithreaded builds are broken atm (at least modularized, which only just got fixed on the main compiler so might need fixes for llvm backend)
  • use of atomics intrinsics in a non-multithreaded build results in a validation error, whereas it had been silently turned into something safe in the old backend. I had to patch dav1d with a “fake atomics” option to #define them away.
  • Non-SIMD builds were coming out with data corruption, which I tracked down to an optimizer bug which had just been fixed upstream the day before I reported it. ;)
  • I haven’t gotten as far as working with any of the SIMD intrinsics, because I’m getting a memory access out of bounds issue when engaging the autovectorizer. I narrowed down a test case with the first error and have reported it; not sure yet whether the problem is in the compiler or in Chrome/V8.

In theory autovectorization is likely to not do much, but significant gains could be made using intrinsics… but only so much, as the operations available are limited and it’s hard to tell what will be efficient or not.

Intermittent file breakages

Very rarely, some files would just break at a certain position in the file for reasons I couldn’t explain. I had one turn up during AV1 testing where a particular video packet that contained two frame OBUs had one OBU header appear ok and the second obviously corrupted. I tracked the corruption back from the codec to the demuxer to the demuxer’s input buffer to my StreamFile abstraction used for loading data from a seekable URL.

Turned out that the offending packet straddled a boundary between HTTP requests — between the second and third megabytes of the file, each requested as a separate Range-based XMLHttpRequest, downloaded as binary strings so the data can be accessed during progress events. But according to the network panel, the second and third megabytes looked fine…. but the *following* request turned up as 512 KiB. …What?

Dumping the binary strings of the second and third megabytes, I immediately realized what was wrong:

Enjoy some tasty binary strings!

The first requests were as expected showing 8-bit characters (ASCII and control chars etc). The request with the broken packet was showing CJK characters indicating the string had probably been misinterpreted as UTF-16

It didn’t take much longer to confirm that the first two bytes of the broken request were 0xFE 0xFF, a UTF-16 Byte Order Mark. This apparently overrides the “overrideMimeType” method’s x-user-defined charset, and there’s no way to override it back. Hypothetically you could probably detect the case and swap bytes back but I think it’s not actually worth it to do full streaming downloads within chunks for the player — it’s better to buffer ahead so you can play reliably.

For now I’ve switched it to use ArrayBuffer XHRs instead of binary strings, which avoids the encoding problem but means data can’t be accessed until each chunk has finished downloading.

ogv.js experimental AV1 decoding

The upcoming ogv.js 1.6.0 release will be the first to include experimental AV1 support, using the dav1d decoder. Thanks to ePirat for the initial work in emscripten cross-compiling the dav1d codebase!

Performance is not very great, but may improve a bit in future from optimizations and potentially a lot from new platform features that may come to WebAssembly in the future.

In particular on Internet Explorer which lacks WebAssembly, performance is very poor but does work at very low resolutions on reasonably fast machines.

On my 2015 MacBook Pro (3.1 GHz 5th-gen Core i7), I can get somewhere between 360p and 480p on the “Caminandes – Llamigos” demo in Safari, while the current VP9 codec gives me 720p.

Safari has a great WebAssembly engine, giving 720p for VP9 or a solid 360p for AV1. 480p AV1 would be achievable with threading.

In IE 11, high-motion scenes in AV1 top up the CPU at only 120p, while VP9 gets away with 240p or so.

IE 11 runs several resolution steps lower, limited by its slow JavaScript engine. It will never get faster, we can only hope it will be gradually replaced.

Multithreaded WebAssembly builds are also included, thanks to emscripten fixing support for modularized threaded programs in 1.38.27. These however do not work in Safari because it has not yet added back SharedArrayBuffer support after it was removed as part of Spectre mitigations.

You can test the threaded builds in Chrome and Firefox with suitable flags enabled (“Wasm threading” for Chrome and “shared memory” for Firefox). VP9 scales well to 2 or 4 threads depending on the resolution, and AV1 scales to 2-ish threads. Will continue to tune and work on this for the future day when Safari supports threading.

Another area where WebAssembly doesn’t perform well is the lack of SIMD instructions — in many places there are tight loops of arithmetic that can be parallelized with vector computation, and native builds of the decoders make extensive use of SIMD. There is some experimental support in some browsers and emscripten but I’m not sure how well they talk to each other or how finalized the standard is so I haven’t tried it.

(It’s conceivable that browser engines could auto-vectorize tight loops in WebAssembly but they would probably be limited to 32-bit arithmetic at best, which wouldn’t parallelize as much as things that can work with 16-bit or 8-bit wide lanes.)

A peek at ogv.js updates

Between other projects, I’ve done a little maintenance on our ogv.js WebM player shim for Safari and IE. This should go out next week as 1.6.0 and should remain compatible with 1.5.x but has a lot of internals refactoring.

Experimental features

This release include an experimental AV1 video decoder using the dav1d library, based on ePirat’s initial work getting it to build with emscripten. This is an initial test, and needs to be brought back up to date with upstream, optimized, etc.

Also included are multithreaded VP8 and VP9 decoders, brought back from the past thanks to emscripten landing updated support and browsers getting closer to re-enabling SharedArrayBuffer. Performance on 2-core and 4+-core machines is encouraging in Firefox and Chrome with flags enabled, but cannot yet be tested in Safari.

SD videos scale over 2 cores reasonably well, with a solid 50% or more decoding speed boost visible. HD can scale over 4 cores, with about 200% speed boost.

Threading must be manually opted in. The AV1 decoder is not yet threaded.

Low-end performance work

In Internet Explorer 11, performance is weaker across the board versus other browsers because its JS engine is an old, frozen version that’s not been optimized for asm.js or WebAssembly. In addition, machines where people are running IE 11 are probably more likely to be older, slower machines.

I did some optimization work which greatly improves perceived playback performance of a 120p WebM VP9/Opus file on my slowest test machine, a 1.67 GHz Atom netbook tablet from 2012 or so.

  • more aggressively selecting software YCbCr-RGB conversion when WebGL is going to be worse
  • Microoptimizations to conversion and extraction of YCbCr data from heap
  • Replaced timer-based audio packet consolidation with buffer-size-based one
  • Tuned audio communication with Flash to reduce calls, per-call overhead
  • Increased number of buffered audio and video frames to survive short pauses better
  • Fix for dropping of individual frames when an adjacent decoded frame is available

I don’t think there’s a lot left to be done in optimizing VP9 decoding for IE; the largest components in profiling at low resolutions are now things like vpx_convolve8_c which really would benefit from integer multiplication and just not being so darn slow… I tried a replacement function micro-optimizing to factor out common additions and bit-shifts in the generated emscripten code but it didn’t seem to make any difference; either that’s the one bit the IE optimizer catches anyway or the bit-shifts are so cheap compared to the memory accesses and multiplications that it just makes no measurable difference. :P

Ah well! This is why I started producing files down to 120p, to handle the super-slow cases.

What is important is making sure that it plays audio smoothly, which it’s now better at than before. And in production we’ll probably add a notice for slow decoding recommending picking another browser…

Refactoring

I’ve also started refactoring work in preparation for future changes to support MSE-style buffering, needed for adaptive bitrate streaming.

Instead of a hodge-podge of closure functions and local variables, I’ve transitioned to using ES6 modules and classes, with a babel transform step for ES5 compatibility.

This helps distinguish retained state vs local variables, makes it easier to find state when debugging, and generally makes things easier to work with in the source. More cleanup still needs to be done in the larger processing functions and the various state vars, but it’s an improvement so far.