Peeking into VP8 video decoding performance

The ogv.js distribution includes Ogg Theora video decoding, which we use on Wikipedia to play back our media files in Safari, IE, and Edge, but also has an experimental mode for WebM files with VP8 video.

WebM/VP8 is better supported by many tools, and provides higher quality video at lower bandwidth usage than Ogg Theora… There are two major reasons I haven’t taken WebM out of “experimental” status:

  1. The demuxer library (nestegg) was hacked in quickly and I need to add support for seeking, “not crashing”, etc
  2. Decoding VP8 video is several times slower than decoding Theora at the same resolution.

The first is a simple engineering problem; I can either hack up nestegg or replace it with something that fits the i/o model I’m working with better.

The second is an intrinsic complexity problem: although the two formats are broadly similar technologies, VP8 requires more computation to decode a frame  than Theora does.

Add to this the fact that we have some environmental limitations on common CPU optimizations for parallelizable code in the C libraries:

  • JavaScript “Worker” threads are different from the low-level pthreads threading model used by C code, and the interfaces required to more closely emulate it are not yet available in Safari or Edge.
  • SIMD (“Same Instruction Multiple Data”) processing is not available in Safari, and not yet production-enabled in Edge.

So, if we can’t use SIMD instructions to parallelize tiny bits of processing, and we can’t simply crank up multithreading to use a now-ubiquitous second CPU core, how can we split up the work?

The first step, I think, is to determine some data boundaries where we might hope to be able to export data from the libvpx library before it’s fully done processing.

What’s what

The VP8 decoder & bitstream format is defined in RFC 6386, from which we can roughly divide decoding into four stages:

  1. …decode input stream entropy encoding…
  2. reconstruct base signal by applying motion vectors against a reference frame
  3. reconstruct residual signal by applying inverse DCT
  4. apply loop filter

The entropy encoding isn’t really a separate stage, but happens as other data gets read in and then fed into each other stage. It’s not really parallelizable itself either.

Aiming high

So before I go researching more on trying to optimize some of these steps, what’s actually the biggest, slowest stuff to work on?

I did a quick profiling run playing back some 480p WebM in Chrome (pretty good compiler, and the profiling tools support profiling just one worker thread which isolates the VP8 decoder nicely). Broke down the resulting function self-time list by rough category and found:

Filter: 54.40%
Motion: 22.75%
IDCT: 10.55%
Other: 12.31%

(“Other” included any function I didn’t tag, as well as general overhead.)

Ouch! Filtering looks like a good first application of a separate step.

Possible directions – staying on CPU

If we stick with the CPU, we can create further Worker threads and send blocks of data to be filtered. In many cases, even when processing of one macroblock to another is hard to parallelize because of data dependencies, there is natural parallelism in the color system — the Y (“luma”, or brightness) plane can be processed by one thread while the U and V (“chroma”, or color) planes can be processed independently by another.

However splitting processing between luma and chroma has a limited benefit. There’s twice as much luma data as chroma, so you save at most 33% of the runtime here.

Macroblocks and subblocks

VP8 divides most processing & data into “macroblocks” of 16×16 pixels, composed of 24 “subblocks” of 4×4 pixels (that’s 16 subblocks of luma data, plus 4 each from the two lower-resolution chroma planes).

In many cases we can parallelize at the subblock level, which divides 24 evenly into 2 cores, or even 4 or 8 cores! But the most performance-sensitive devices will usually only have 2 CPU cores available, giving us only a 50% potential speedup.

Going crazy – GPU time?

Graphics processors are much more aggressively multithreaded, with multiple cores and ability to handle many more simultaneous operations.

If we run more operations on the GPU, we might actually have a chance to run 24 subblocks all in parallel!

However there’s a lot more fuzziness involved in getting there:

  • may have to rewrite code from C into GLSL
  • have to figure out how to operate as fragment or vertex shaders
  • have to figure out how to efficiently get data in/out
  • oh and you can’t use WebGL from a Worker thread, so have to bounce data up to the main thread and back down to the codec
  • once all that’s done, what’s the actual performance of the GPU vs the CPU per-operation? no idea 😀

So, there would seem to be great scaling potential on the GPU side, but there are a lot of unknowns.

Worth investigating but we’ll see what’s really feasible…

 

1000fps no more

Different media file formats encode things like time and frame rates differently… or not at all.
 
WebM doesn’t list a frame rate; each frame is simply given a position in time. Meanwhile the older Ogg Theora codec defines a consistent, pre-defined frame rate for a stream, but allows frames to be declared as duplicates of the previous frame as an optimization.
 
At the intersection of these two, some files auto-converted from WebM to Ogg on Wikimedia’s servers end up claiming to encode a “1000 fps” video stream, where nearly all the frames are dupes and there’s actually ~25-30 or at most 60 actual frames per second.
 
I had to put a hack into my ogv.js player to handle these, because actually trying to draw 1000 frames per second was kind of slow. 😉
https://github.com/brion/ogv.js/issues/349

ogv.js 1.1.0 alpha now on npm

ogv.js 1.1.0-alpha.0 is now available for download:

Big thanks to Stephan Hesse who retooled large chunks of the build system using webpack, which brought us a lot closer to the npm package release.

ogv.js 1.1.0 is a drop-in update to 1.0; many internal classes are no longer leaked to global namespace, but the public OGV* classes remain as they are.

The internal AudioFeeder class is also available as a separate npm audio-feeder package; more internal classes will follow including the streaming URL reader and the WebGL-accelerated YUV canvas.

In addition to internal/build changes, this release has major fixes for seeking in ogg files, implements the volume property, and adds support for more properties and events (not yet 100% up to spec, but closer).

After a few more days shaking this out I’ll push it up to MediaWiki’s TimedMediaHandler extension, where it’ll make it to Wikipedia and Wikimedia Commons.

Windows 10 Insider build 14316 supports VP9 & Opus in WebM (sorta)

On the ‘Insider’ build 14316 of Windows 10, WebM VP9+Opus video files can be played in Microsoft Edge! Well, indirectly via Media Source Extensions, but still. 🙂

Try it out! https://brionv.com/misc/msetest/ (primitive demo)

You may have to manually enable VP9 in about:flags however, as it defaults to on only if hardware-accelerated, and I have no idea which GPU drivers support that on Windows yet.

flags

It’d be great if HTMLMediaElement and MediaSource’s methods for checking playback support could also allow checking the hw-acceleration status — especially on mobile, it’s often preferable to use a codec that’s hardware-accelerated (like H.264) to use less battery life even if you would get better visual quality from a more advanced codec (like VP9).

On the other hand in Wikipedia’s case, we don’t use H.264 for patent license issues, so our alternative to software VP9 decoding is JavaScript Theora or VP8 decoding, which is going to be harder on the CPU than nicely tuned native code for VP9 decoding.

Alliance for Open Media code drop & more hardware partners

Very exciting! The new video codec so far is mostly based on Google’s in-development VP10 (next gen of the VP8/VP9 used in WebM) but is being co-developed and supported with a number of other orgs.

  • CPU/GPU/SoC makers: Intel, AMD, ARM, NVidia
  • OS & machine makers: Google, Microsoft, Cisco
  • Browser makers: Mozilla, Google, Microsoft
  • Content farms: Netflix, Google (YouTube)

Microsoft is also actively working on VP8/VP9 support for Windows 10, with some limited compatibility in preview releases.

As always, Apple remains conspicuously absent. 🙁

Like the earlier VP8/VP9 the patent licenses are open and don’t have the kind of weird clauses that have tripped up MPEGLA’s H.264 and HEVC/H.265 in some quarters. (*cough* Linux *cough* Wikipedia)

Totally trying to figure out how we can get involved at this stage; making sure I can build the codec in my iOS app and JavaScript shim environments will be a great start!

Web Modules for JS + asset files: why don’t you exist yet?

Ugh, there really needs to be a sane way to package a “web” module that includes both JavaScript code and other assets (images, fonts, styles, Flash shims, and other horrors).

I’m cleaning up some code from my ogv.js JavaScript audio/video player, including breaking out the audio output and streaming URL input as separate modules for reuse… The audio output class is mostly a bit of JavaScript, but includes a Flash shim for IE 10/11 which requires bundling a .swf file and making it web-accessible.

Browserify can package up a virtual filesystem into your JS code — ok for loading a WebGL shader maybe — but can’t expose it to the HTML side of things where you want to reference a file by URL, especially if you don’t want it preloaded.

Bower will fetch everything from the modules you specify, but doesn’t distinguish clearly between files that should be included in the output (or what paths will be used to reference them) and random stuff in your repo. They kinda recommend using further tools to distill down an actual set of output…

Dear LazyWeb: Is there anything else in the state of the art that can make it a little easier to bundle together this sort of module into larger library or app distributions?

Windows 10 Objective-C bridge

While I was waiting for updates to download I finally took a look at Microsoft’s Objective-C ‘bridge’ for porting iOS apps to the Windows 10 platform.

It’s technically a pretty nice system; they’re using clang to build your Objective-C code natively to the Universal Windows Platform as native x86 or ARM code, with an open-source implementation of (large parts of) Apple’s Foundation and UIKit APIs on top of UWP. Since you’re building native code, you also have full access to the Windows 10 APIs if you want to throw in some ‪#‎ifdefs‬.

I suspect there are a lot fewer ‘moving parts’ in it than were in the ‘Project Astoria’ Android bridge (which would have had to deal with Java etc), so in retrospect it’s not super surprising that they kept this one while canceling the Android bridge.

I don’t know if it’ll get used much other than for games that targeted iOS first and want to port to Xbox One and Windows tablets easily, but it’s a neat idea.

Probably tricky to get something like the Wikipedia app working since that app does lots of WebView magic etc that’s probably a bad API fit… but might be fun to play with!

Wikipedia data dumps future thoughts

There’s some talk & beginnings of work planning on a major overhaul of the Wikipedia/Wikimedia data dumps process.

The basic data model for the main content dumps hasn’t changed much in 10 years or so, when I switched us from raw blobs of SQL ‘INSERT’ statements to an XML data stream in order to abstract away upcoming storage schema changes… fields have been added over the years, and there have been some changes in how the dumps are generated to partially parallelize the process but the core giant-XML-stream model is scaling worse and worse as our data sets continue to grow.

One possibility is to switch away from the idea of producing a single data snapshot in a single or small set of downloadable files… perhaps to a model more like a software version control system, such as git.

Specific properties that I think will help:

  • the master data set can change often, in small increments (so can fetch updates frequently)
  • updates along a branch are ordered, making updates easier to reason about
  • local data set can be incrementally updated from the master data set, no matter how long between updates (so no need to re-download entire .xml.bz2 every month)
  • network protocol for updates, and access to versioned storage within the data set can be abstracted behind a common tool or library (so you don’t have to write Yet Another Hack to seek within compressed stream or Yet Another Bash Script to wget the latest files)

Some major open questions:

  • Does this model sound useful for people actually using Wikipedia data dumps in various circumstances today?
  • What help might people need in preparing their existing tools for a switch to this kind of model?
  • Does it make sense to actually use an existing VCS such as git itself? Or are there good reasons to make something bespoke that’s better-optimized for the use case or easier to embed in more complex cross-platform tools?
  • When dealing with data objects removed from the wiki databases for copyright/privacy/legal issues, does this have implications for the data model and network protocol?
    • git’s crypto-hash-based versioning tree may be tricky here
    • Do we need a way to both handle the “fast-forward” updates of local to master and to be able to revert back locally (eg to compare current and old revisions)
  • Technical issues in updating the master VCS from the live wikis?
    • Push updates immediately from MediaWiki hook points, or use an indirect notify+pull?
    • Does RCStream expose enough events and data already for the latter, or something else needed to ‘push’?
    • Can update jobs for individual revisions be efficient enough or do we need more batching?

Things to think about!

Wikimedia Foundation update

Heads-up San Francisco peeps — I’ll be heading into town next week to help my fellow Wikimedia Foundation folks talk and process and plan and generally help turn what’s been an unfortunate leadership and communication crisis into a chance to really make improvements. I’ve been really impressed and inspired by the mails and posts and discussions I’ve seen internally and externally, and I’m really proud of the maturity and understanding you have all shown. Wikimedians and staffers of all stripes, y’all are awesome and we’re going to come through this stronger, both within the company and in the broader movement community.