Revisiting AVSampleBufferDisplayLayer on iOS 11

When I tried using iOS’s AVSampleBufferDisplayLayer in OGVKit last year, I had a few problems. Most notably a rendering bug on one device, and inability to display frames with 4:4:4 subsampling (higher chroma quality).

Since the rendering path I used instead is OpenGL-based, and OpenGL is being deprecated in iOS 12… figured it might be worth another look rather than converting the shader and rendering code to Metal.

The rendering bug that was striking at 360p on some devices was fixed by Apple in iOS 11 (thanks all!), so that’s a nice improvement!

Fiddling around with the available pixel formats, I found that while 4:4:4 YCbCr still doesn’t work, 4:4:4:4 AYCbCr does work in iOS 11 and iOS 12 beta!

First I swapped out the pixel format without adjusting my data format conversions, which produced a naturally corrupt image:

Then I switched up my SIMD-accelerated sample format conversion to combine the three planes of data and a fourth fixed alpha value, but messed it up and got this:

Turned out to be that I’d only filled out 4 items of a 16×8 vector literal. Durrrr. :) Fixed that and it’s now preeeeetty:

(The sample video is a conversion of a silly GIF, with lots of colored text and sharp edges that degrade visibly at 4:2:0 and 4:2:2 subsampling. I believe the name of the original file was “yo-im-hacking-furiously-dude.gif”)

A remaining downside is that on my old 32-bit devices stuck on iOS 9 and iOS 10, the 4:2:2 and 4:4:4 samples don’t play. So either I need to keep the OpenGL code path for them, or just document it won’t work, or maybe do a runtime check and downsample to 4:2:0.

Mobile video encoding tradeoffs

I spent a little time yesterday and today poking at an old project to encode video into the WebM format we use at Wikipedia on your iPhone or iPad so you could, potentially, take video and upload it directly.

The default encoding settings were mmmuuuucccchhhh tttoooooo ssslllooowww to be practical, so I had to tune it to use a faster configuration. But to maintain quality, you have to bump up the bitrate (and thus file size) significantly.

This same tradeoff is made in the hardware video encoders in the device, too! When you’re making a regular camera recording the bitrate is actually several times higher than it would be on a typical download/stream of the same video from YouTube/Netflix/etc. You just don’t have the luxury of the extra encoding time on a modest mobile chip, especially not if you’re recording live.

Limitations of AVSampleBufferDisplayLayer on iOS

In my last post I described using AVSampleBufferDisplayLayer to output manually-uncompressed YUV video frames in an iOS app, for playing WebM and Ogg files from Wikimedia Commons. After further experimentation I’ve decided to instead stick with using OpenGL ES directly, and here’s why…

  • 640×360 output regularly displays with a weird horizontal offset corruption on iPad Pro 9.7″. Bug filed as rdar://29810344
  • Can’t get any pixel format with 4:4:4 subsampling to display. Theora and VP9 both support 4:4:4 subsampling, so that made some files unplayable.
  • Core Video pixel buffers for 4:2:2 and 4:4:4 are packed formats, and it prefers 4:2:0 to be a weird biplanar semi-packed format. This requires conversion from the planar output I already have, which may be cheap with Neon instructions but isn’t free.

Instead, I’m treating each plane as a separate one-channel grayscale image, which works for any chroma subsampling ratios. I’m using some Core Video bits (CVPixelBufferPool and CVOpenGLESTextureCache) to do texture setup instead of manually calling glTeximage2d with a raw source blob, which improves a few things:

  • Can do CPU->GPU memory copy off main thread easily, without worrying about locking my GL context.
  • No pixel format conversions, so straight memcpy for each line…
  • Buffer pools are tied to the video buffer’s format object, and get swapped out automatically when the format changes (new file, or file changes resolution).
  • Don’t have to manually account for stride != width in the texture setup!

It could be more efficient still if I could pre-allocate CVPixelBuffers with on-GPU memory and hand them to libvpx and libtheora to decode into… but they currently lack sufficient interfaces to accept frame buffers with GPU-allocated sizes.

A few other oddities I noticed:

  • The clean aperture rectangle setting doesn’t seem to be preserved when creating a CVPixelBuffer via CVPixelBufferPool; I have to re-set it when creating new buffers.
  • For grayscale buffers, the clean aperture doesn’t seem to be picked up by CVOpenGLESTextureGetCleanTexCoords. Not sure if this is only supposed to work with Y’CbCr buffer types or what… however I already have all these numbers in my format object and just pull from there. :)

I also fell down a rabbit hole researching color space issues after noticing that some of the video formats support multiple colorspace variants that may imply different RGB conversion matrices… and maybe gamma…. and what do R, G, and B mean anyway? :) Deserves another post sometime.

 

 

Canvas, Web Audio, MediaStream oh my!

I’ve often wished that for ogv.js I could send my raw video and audio output directly to a “real” <video> element for rendering instead of drawing on a <canvas> and playing sound separately to a Web Audio context.

In particular, things I want:

  • Not having to convert YUV to RGB myself
  • Not having to replicate the behavior of a <video> element’s sizing!
  • The warm fuzzy feeling of semantic correctness
  • Making use of browser extensions like control buttons for an active video element
  • Being able to use browser extensions like sending output to ChromeCast or AirPlay
  • Disabling screen dimming/lock during playback

This last is especially important for videos of non-trivial length, especially on phones which often have very aggressive screen dimming timeouts.

Well, in some browsers (Chrome and Firefox) now you can do at least some of this. :)

I’ve done a quick experiment using the <canvas> element’s captureStream() method to capture the video output — plus a capture node on the Web Audio graph — combining the two separate streams into a single MediaStream, and then piping that into a <video> for playback. Still have to do YUV to RGB conversion myself, but final output goes into an honest-to-gosh <video> element.

To my great pleasure it works! Though in Firefox I have some flickering that may be a bug, I’ll have to track it down.

Some issues:

  • Flickering on Firefox. Might just be my GPU, might be something else.
  • The <video> doesn’t have insight to things like duration, seeking, etc, so can’t rely on native controls or API of the <video> alone acting like a native <video> with a file source.
  • Pretty sure there are inefficiencies. Have not tested performance or checked if there’s double YUV->RGB->YUV->RGB going on.

Of course, Chrome and Firefox are the browsers I don’t need ogv.js for for Wikipedia’s current usage, since they play WebM and Ogg natively already. But if Safari and Edge adopt the necessary interfaces and WebRTC-related infrastructure for MediaStreams, it might become possible to use Safari’s full screen view, AirPlay mirroring, and picture-in-picture with ogv.js-driven playback of Ogg, WebM, and potentially other custom or legacy or niche formats.

Unfortunately I can’t test whether casting to a ChromeCast works in Chrome as I’m traveling and don’t have one handy just now. Hoping to find out soon! :D

Alliance for Open Media code drop & more hardware partners

Very exciting! The new video codec so far is mostly based on Google’s in-development VP10 (next gen of the VP8/VP9 used in WebM) but is being co-developed and supported with a number of other orgs.

  • CPU/GPU/SoC makers: Intel, AMD, ARM, NVidia
  • OS & machine makers: Google, Microsoft, Cisco
  • Browser makers: Mozilla, Google, Microsoft
  • Content farms: Netflix, Google (YouTube)

Microsoft is also actively working on VP8/VP9 support for Windows 10, with some limited compatibility in preview releases.

As always, Apple remains conspicuously absent. :(

Like the earlier VP8/VP9 the patent licenses are open and don’t have the kind of weird clauses that have tripped up MPEGLA’s H.264 and HEVC/H.265 in some quarters. (*cough* Linux *cough* Wikipedia)

Totally trying to figure out how we can get involved at this stage; making sure I can build the codec in my iOS app and JavaScript shim environments will be a great start!

Video decoding in the JavaScript platform: “ogv.js, how U work??”

We’ve started deploying my ogv.js JavaScript video/audio playback engine to Wikipedia and Wikimedia Commons for better media compatibility in Safari, Internet Explorer and the new Microsoft Edge browser.

“It’s an older codec, but it checks out. I was about to let them through.”

This first generation uses the Ogg Theora video codec, which we started using on Wikipedia “back in the day” before WebM and MP4/H.264 started fighting it out for dominance of HTML5 video. In fact, Ogg Theora/Vorbis were originally proposed as the baseline standard codecs for HTML5 video and audio elements, but Apple and Microsoft refused to implement it and the standard ended up dropping a baseline requirement altogether.

Ah, standards. There’s so many to choose from!

I’ve got preliminary support for WebM in ogv.js; it needs more work but the real blocker is performance. WebM’s VP8 and newer VP9 video codecs provide much better quality/compression ratios, but require more CPU horsepower to decode than Theora… On a fast MacBook Pro, Safari can play back ‘Llama Drama’ in 1080p Theora but only hits 480p in VP8.

Llama drama in Theora 1080p

That’s about a 5x performance gap in terms of how many pixels we can push… For now, the performance boost from using an older codec is worth it, as it gets older computers and 64-bit mobile devices into the game.

But it also means that to match quality, we have to double the bitrate — and thus bandwidth – of Theora output versus VP8 at the same resolution. So in the longer term, it’d be nice to get VP8 — or the newer VP9 which halves bitrate again — working well enough on ogv.js.

emscripten: making ur C into JS

ogv.js’s player logic is handwritten JavaScript, but the guts of the demuxer and decoders are cross-compiled from well-supported, battle-tested C libraries.

Emscripten is a magical tool developed at Mozilla to help port large C/C++ codebases like games to the web platform. In short, it runs your C/C++ code through the well-known clang compiler, but instead of producing native code it uses a custom LLVM backend that produces JavaScript code that can run in any modern browser or node.js.

Awesome town. But what are the limitations and pain points?

Integer math

Readers with suitably arcane knowledge may be aware that JavaScript has only one numeric type: 64-bit double-precision floating-point.

This is “convenient” for classic scripting in that you don’t have to worry about picking the right numeric type, but it has several horrible, horrible consequences:

  1. When you really wanted 32-bit integers, floating-point math is going to be much slower
  2. When you really wanted 64-bit integers, floating-point math is going to lose precision if your numbers are too big… so you have to emulate with 32-bit integers
  3. If you relied on the specific behavior of 32-bit integer multiplication, you may have to use a slow polyfill of Math.imul

Luckily, because of #1 above, JavaScript JIT compilers have gone to some trouble to optimize common integer math operations. That is, JavaScript engines do support integer types and integer math, just you don’t know for sure when you have an integer at the source level.

Did I say “luckily”? :P

So this leads to one more ugly consequence:

  1. In order to force the JIT compiler to run integer math, emscripten output coerces types constantly — “(x|0)” to force to 32-bit int, or “+x” to force to 64-bit float.

This actually performs well once it’s through the JIT compiler, but it bloats the .js code that we have to ship to the browser.

The heap is an island

Emscripten provides a C-like memory model by using Typed Arrays: a single ArrayBuffer provides a heap that can be read/written directly as various integer and floating point types.

However…

Because all pointers are indexes into the heap, there’s no way for C code to reference data in an external ArrayBuffer or other structure. This is obviously an issue when your video codec needs to decode a data packet that’s been passed to it from JavaScript!

Currently I’m simply copying the input packets into emscripten’s heap in a wrapper function, then calling the decoder on the copy. This works, but the extra copy makes me sad. It’s also relatively slow in Internet Explorer, where the copy implementation using Uint8Array.set() seems to be pretty inefficient.

Getting data out can be done “zero-copy” if you’re careful, by creating a typed-array subview of the emscripten heap; this can be used for instance to upload a WebGL texture directly from the decoder. Neat!

But, that trick doesn’t work when you need to pass data between worker threads.

Workers of the JavaScript world, unite!

Parallel computing is now: these days just about everything from your high-end desktop to your low-end smartphone has at least two CPU cores and can perform multiple tasks in parallel.

Unfortunately, despite half a century of computer science research and a good decade of marketplace factors, doing parallel programming well is still a Hard Problem.

Regular JavaScript provides direct access to only a single thread of execution, which keeps things simple but can be a performance bottleneck. Browser makers introduced Web Workers to fill this gap without introducing the full complexities of shared-memory multithreading…

Essentially, each Worker is its own little JavaScript universe: the main thread context can’t access data in a Worker directly, and the Worker can’t access data from the main context. Neither can one thread cause the other to block… So to communicate between threads, you have to send asynchronous messages.

This is actually a really nice model that reduces the number of ways you can shoot yourself in the foot with multithreading!

But, it maps very poorly to C/C++ threads, where you start with shared memory and foot-shooting and try to build better abstractions on top of that.

So, we’re not yet able to make use of any multithreaded capabilities in the actual decoders. :(

But, we can run the decoders themselves in Worker threads, as long as they’re factored into separate emscripten subprograms. This keeps the main thread humming smoothly even when video decoding is a significant portion of wall-clock time, and can provide a little bit of actual parallelism by running video and audio decoding at the same time.

The Theora and VP8 decoders currently have no inherent multithreading available, but VP9 can so that’s worth looking out for in the future…

Some browser makers are working on providing an “opt-in” shared-memory threading model through an extended ‘SharedArrayBuffer’ that emscripten can make use of, but this is not yet available in any of my target browsers (Safari, IE, Edge).

Waiting for SIMD

Modern CPUs provide SIMD instructions (“Single Instruction Multiple Data”) which can really optimize multimedia operations where you need to do the same thing a lot of times on parallel data.

Codec libraries like libtheora and libvpx use these optimized instructions explicitly in key performance hotspots when compiling to native code… but how do you deal with this when compiling via JavaScript?

There is ongoing work in emscripten and by at least some browser vendors to expose common SIMD operations to JavaScript; I should be able to write suitable patches to libtheora and libvpx to use the appropriate C intrinsics and see if this helps.

But, my main targets (Safari, IE, Edge) don’t support SIMD in JS yet so I haven’t started…

GPU Madness

The obvious next thing to ask is “Hey what about the GPU?” Modern computers come with amazing high-throughput parallel-processing graphics units, and it’s become quite the rage to GPU accelerate everything from graphics to spreadsheets.

The good news is that current versions of all main browsers support WebGL, and ogv.js uses it if available to accelerate drawing and YCbCr-RGB colorspace conversion.

The bad news is that’s all we use it for so far — the actual video decoding is all on the CPU.

It should be possible to use the GPU for at least parts of the video decoding steps. But, it’s going to require jumping through some hoops…

  • WebGL doesn’t provide general-purpose compute shaders, so would have to shovel data into textures and squish computation into fragment shaders meant for processing pixels.
  • WebGL is only available on the main thread, so if decoding is done in a worker there’ll be additional overhead shipping data between threads
  • If we have to read data back from the GPU, that can be slow and block the CPU, dropping efficiency again
  • The codec libraries aren’t really set up with good GPU offloading points in them, so this may be Hard To Do.

libvpx at least has a fork with some OpenCL and RenderScript support — it’s worth investigating. But no idea if this is really feasible in WebGL.

 

In the meantime, I’ve got lots of other things to fix in Wikipedia’s video support so will be concentrating on that stuff, but will keep on improving this as the JS platform evolves!

ogv.js soft launch on Wikipedia and Wikimedia Commons

Soft launch of ogv.js on Wikipedia and Wikimedia Commons has begun! This initial deployment covers the desktop view only, so iPhones and iPads won’t get the media player yet in mobile view.

ogv.js provides a JavaScript compatibility shim for Ogg audio and video playback in Safari 6.1 and higher, IE 10/11, and Microsoft Edge browsers, which gets Wikipedia’s media files working in those browsers. (Due to patent licensing concerns, we don’t provide files in the common MP3 or MP4 H.264/AAC formats, and this has made it difficult to use media files reliably across browsers as Apple and Microsoft have not adopted the free Ogg or WebM formats.)

See list of pending fixes for additional improvements that should go out next week, after which I’ll make wider announcements.

Here, have some samples! In Firefox, Chrome, or Opera these will “just work” with native WebM playback, while in Safari/IE/Edge they will “just work” with JavaScript Ogg playback:

Curiosity’s Seven Minutes of Terror

 

Caminandes – Gran Dillama

 

Using Web Worker threading in ogv.js for smoother playback

I’ve been cleaning up MediaWiki’s “TimedMediaHandler” extension in preparation for merging integration with my ogv.js JavaScript media player for Safari and IE/Edge when no native WebM or Ogg playback is available. It’s coming along well, but one thing continued to worry me about performance: when things worked it was great, but if the video decode was too slow, it could make your browser very sluggish.

In addition to simply selecting too high a resolution for your CPU, this can strike if you accidentally open a debug console during playback (Safari, IE) or more seriously if you’re on an obscure platform that doesn’t have a JavaScript JIT compiler… or using an app’s embedded browser on your iPhone.

Or simply if the integration code’s automatic benchmark overestimated your browser speed, running too much decoding just made everything crap.

Luckily, the HTML5 web platform has a solution — Web Workers.

Workers provide a limited ability to do multithreading in JavaScript, which means the decoder thread can take as long as it wants — the main browser thread can keep responding to user input in the meantime.

The limitation is that scripts running in a Worker have no direct access to your code or data running in the web page’s main thread — you can communicate only by sending messages with ‘raw’ data types. Folks working with lots of DOM browser nodes thus can’t get much benefit, but for buffer-driven compute tasks like media decoding it’s perfect!

Threading comms overhead

My first attempt was to take the existing decoder class (an emscripten module containing the Ogg demuxer, Theora video decoder, and Vorbis audio decoder, etc) and run it in the worker thread, with a proxy object sending data and updated object properties back and forth.

This required a little refactoring to make the decoder interfaces asynchronous, taking callbacks instead of returning results immediately.

It worked pretty well, but there was a lot of overhead due to the demuxer requiring frequent back-and-forth calls — after every processing churn, we had to wait for the demuxer to return its updated status to us on the main thread.

This only took a fraction of a millisecond each time, but a bunch of those add up when your per-frame budget is 1/30 (or even 1/60) second!

 

I had been intending a bigger refactor of the code anyway to use separate emscripten modules for the demuxer and audio/video decoders — this means you don’t have to load code you won’t need, like the Opus audio decoder when you’re only playing Vorbis files.

It also means I could change the coupling, keeping the demuxer on the main thread and moving just the audio/video decoders to workers.

This gives me full speed going back-and-forth on the demuxer, while the decoders can switch to a more “streaming” behavior, sending packets down to be decoded and then displaying the frames or queueing the audio whenever it comes back, without having to wait on it for the next processing iteration.

The result is pretty awesome — in particular on older Windows machines, IE 11 has to use the Flash plugin to do audio and I was previously seeing a lot of “stuttery” behavior when the video decode blocked the Flash audio queueing or vice versa… now it’s much smoother.

The main bug left in the worker mode is that my audio/video sync handling code doesn’t properly handle the case where video decoding is consistently too slow — when we were on the main thread, this caused the audio to halt due to the main thread being blocked; now the audio just keeps on going and the video keeps playing as fast as it can and never catches up. :)

However this should be easy to fix, and having it be wrong but NOT FREEZING YOUR BROWSER is an improvement over having sync but FREEZING YOUR BROWSER. :)