I spent a little time yesterday and today poking at an old project to encode video into the WebM format we use at Wikipedia on your iPhone or iPad so you could, potentially, take video and upload it directly.
The default encoding settings were mmmuuuucccchhhh tttoooooo ssslllooowww to be practical, so I had to tune it to use a faster configuration. But to maintain quality, you have to bump up the bitrate (and thus file size) significantly.
This same tradeoff is made in the hardware video encoders in the device, too! When you’re making a regular camera recording the bitrate is actually several times higher than it would be on a typical download/stream of the same video from YouTube/Netflix/etc. You just don’t have the luxury of the extra encoding time on a modest mobile chip, especially not if you’re recording live.
In my last post I described using AVSampleBufferDisplayLayer to output manually-uncompressed YUV video frames in an iOS app, for playing WebM and Ogg files from Wikimedia Commons. After further experimentation I’ve decided to instead stick with using OpenGL ES directly, and here’s why…
640×360 output regularly displays with a weird horizontal offset corruption on iPad Pro 9.7″. Bug filed as rdar://29810344
Can’t get any pixel format with 4:4:4 subsampling to display. Theora and VP9 both support 4:4:4 subsampling, so that made some files unplayable.
Core Video pixel buffers for 4:2:2 and 4:4:4 are packed formats, and it prefers 4:2:0 to be a weird biplanar semi-packed format. This requires conversion from the planar output I already have, which may be cheap with Neon instructions but isn’t free.
Instead, I’m treating each plane as a separate one-channel grayscale image, which works for any chroma subsampling ratios. I’m using some Core Video bits (CVPixelBufferPool and CVOpenGLESTextureCache) to do texture setup instead of manually calling glTeximage2d with a raw source blob, which improves a few things:
Can do CPU->GPU memory copy off main thread easily, without worrying about locking my GL context.
No pixel format conversions, so straight memcpy for each line…
Buffer pools are tied to the video buffer’s format object, and get swapped out automatically when the format changes (new file, or file changes resolution).
Don’t have to manually account for stride != width in the texture setup!
It could be more efficient still if I could pre-allocate CVPixelBuffers with on-GPU memory and hand them to libvpx and libtheora to decode into… but they currently lack sufficient interfaces to accept frame buffers with GPU-allocated sizes.
A few other oddities I noticed:
The clean aperture rectangle setting doesn’t seem to be preserved when creating a CVPixelBuffer via CVPixelBufferPool; I have to re-set it when creating new buffers.
For grayscale buffers, the clean aperture doesn’t seem to be picked up by CVOpenGLESTextureGetCleanTexCoords. Not sure if this is only supposed to work with Y’CbCr buffer types or what… however I already have all these numbers in my format object and just pull from there. :)
I also fell down a rabbit hole researching color space issues after noticing that some of the video formats support multiple colorspace variants that may imply different RGB conversion matrices… and maybe gamma…. and what do R, G, and B mean anyway? :) Deserves another post sometime.
One of my little projects is OGVKit, a library for playing Ogg and WebM media on iOS, which at some point I want to integrate into the Wikipedia app to fix audio/video playback in articles. (We don’t use MP4/H.264 due to patent licensing concerns, but Apple doesn’t support these formats, so we have to jump through some hoops…)
A trick with working with digital video is that video frames are usually processed, compressed, and stored using the YUV (aka Y’CbCr) colorspace instead of the RGB used in the rest of the digital display pipeline.
This means that you can’t just take the output from a video decoder and blit it to the screen — you need to know how to dig out the pixel data and recombine it into RGB first.
Currently OGVKit draws frames using OpenGL ES, manually attaching the YUV planes as separate textures and doing conversion to RGB in a shader — I actually ported it over from ogv.js‘s WebGL drawing code. But surely a system like iOS with pervasive hardware-accelerated video playback already has some handy way to draw YUV frames?
While researching working with system-standard CMSampleBuffer objects to replace my custom OGVVideoBuffer class, I discovered that iOS 8 and later (and macOS version something) do have a such handy output path: AVSampleBufferDisplayLayer. This guy has three special tricks:
CMSampleBuffer objects go in, pretty pictures on screen come out!
Can manage a queue of buffers, synchronizing display times to a provided clock!
If you pass compressed H.264 buffers, it handles decompression transparently!
I’m decompressing from a format AVFoundation doesn’t grok so the transparent decompression isn’t interesting to me, but since it claimed to accept uncompressed buffers too I figured this might simplify my display output path…
The queue system sounds like it might simplify my timing and state management, but is a bigger change to my code to make so I haven’t tried it yet. You can also tell it to display one frame at a time, which means I can use my existing timing code for now.
There are however two major caveats:
AVSampleBufferDisplayLayer isn’t available on tvOS… so I’ll probably end up repackaging the OpenGL output path as an AVSampleBufferDisplayLayer lookalike eventually to try an Apple TV port. :)
Uncompressed frames must be in a very particular format or you get no visible output and no error messages.
Specifically, it wants a CMSampleBuffer backed by a CVPixelBuffer that’s IOSurface-backed, using bi-planar YUV 4:2:0 pixel format (kCVPixelFormatType_420YpCbCr8BiPlanarVideoRange
or kCVPixelFormatType_420YpCbCr8BiPlanarFullRange). However libtheora and libvpx produce output in traditional tri-planar format, with separate Y, U and V planes. This meant I had to create buffers in appropriate format with appropriate backing memory, copy the Y plane, and then interleave the U and V planes into a single chroma muddle.
My first super-naive attempt took 10ms per 1080p frame to copy on an iPad Pro, which pretty solidly negated any benefits of using a system utility. Then I realized I had a really crappy loop around every pixel. ;)
Using memcpy — a highly optimized system function — to copy the luma lines cut the time down to 3-4ms per frame. A little loop unrolling on the chroma interleave brought it to 2-3ms, and I was able to get it down to about 1ms per frame using a couple ARM-specific vector intrinsic functions, inspired by assembly code I found googling around for YUV layout conversions.
It turns out you can interleave 8 pixels at a time in three instructions using two vector reads and one write, and I didn’t even have to dive into actual assembly:
This might be even faster if copying is done on a “slice” basis during decoding, while the bits of the frame being copied are in cache, but I haven’t tried this yet.
With the more efficient copies, the AVSampleBufferDisplayLayer-based output doesn’t seem to use more CPU than the OpenGL version, and using CMSampleBuffers should allow me to take output from the Ogg and WebM decoders and feed it directly into an AVAssetWriter for conversion into MP4… from there it’s a hop, skip and a jump to going the other way, converting on-device MP4 videos into WebM for upload to Wikimedia Commons…
“It’s an older codec, but it checks out. I was about to let them through.”
This first generation uses the Ogg Theora video codec, which we started using on Wikipedia “back in the day” before WebM and MP4/H.264 started fighting it out for dominance of HTML5 video. In fact, Ogg Theora/Vorbis were originally proposed as the baseline standard codecs for HTML5 video and audio elements, but Apple and Microsoft refused to implement it and the standard ended up dropping a baseline requirement altogether.
Ah, standards. There’s so many to choose from!
I’ve got preliminary support for WebM in ogv.js; it needs more work but the real blocker is performance. WebM’s VP8 and newer VP9 video codecs provide much better quality/compression ratios, but require more CPU horsepower to decode than Theora… On a fast MacBook Pro, Safari can play back ‘Llama Drama’ in 1080p Theora but only hits 480p in VP8.
That’s about a 5x performance gap in terms of how many pixels we can push… For now, the performance boost from using an older codec is worth it, as it gets older computers and 64-bit mobile devices into the game.
But it also means that to match quality, we have to double the bitrate — and thus bandwidth — of Theora output versus VP8 at the same resolution. So in the longer term, it’d be nice to get VP8 — or the newer VP9 which halves bitrate again — working well enough on ogv.js.
emscripten: making ur C into JS
Awesome town. But what are the limitations and pain points?
This is “convenient” for classic scripting in that you don’t have to worry about picking the right numeric type, but it has several horrible, horrible consequences:
When you really wanted 32-bit integers, floating-point math is going to be much slower
When you really wanted 64-bit integers, floating-point math is going to lose precision if your numbers are too big… so you have to emulate with 32-bit integers
If you relied on the specific behavior of 32-bit integer multiplication, you may have to use a slow polyfill of Math.imul
Did I say “luckily”? :P
So this leads to one more ugly consequence:
In order to force the JIT compiler to run integer math, emscripten output coerces types constantly — “(x|0)” to force to 32-bit int, or “+x” to force to 64-bit float.
This actually performs well once it’s through the JIT compiler, but it bloats the .js code that we have to ship to the browser.
The heap is an island
Emscripten provides a C-like memory model by using Typed Arrays: a single ArrayBuffer provides a heap that can be read/written directly as various integer and floating point types.
Currently I’m simply copying the input packets into emscripten’s heap in a wrapper function, then calling the decoder on the copy. This works, but the extra copy makes me sad. It’s also relatively slow in Internet Explorer, where the copy implementation using Uint8Array.set() seems to be pretty inefficient.
Getting data out can be done “zero-copy” if you’re careful, by creating a typed-array subview of the emscripten heap; this can be used for instance to upload a WebGL texture directly from the decoder. Neat!
But, that trick doesn’t work when you need to pass data between worker threads.
Parallel computing is now: these days just about everything from your high-end desktop to your low-end smartphone has at least two CPU cores and can perform multiple tasks in parallel.
Unfortunately, despite half a century of computer science research and a good decade of marketplace factors, doing parallel programming well is still a Hard Problem.
This is actually a really nice model that reduces the number of ways you can shoot yourself in the foot with multithreading!
But, it maps very poorly to C/C++ threads, where you start with shared memory and foot-shooting and try to build better abstractions on top of that.
So, we’re not yet able to make use of any multithreaded capabilities in the actual decoders. :(
But, we can run the decoders themselves in Worker threads, as long as they’re factored into separate emscripten subprograms. This keeps the main thread humming smoothly even when video decoding is a significant portion of wall-clock time, and can provide a little bit of actual parallelism by running video and audio decoding at the same time.
The Theora and VP8 decoders currently have no inherent multithreading available, but VP9 can so that’s worth looking out for in the future…
Some browser makers are working on providing an “opt-in” shared-memory threading model through an extended ‘SharedArrayBuffer’ that emscripten can make use of, but this is not yet available in any of my target browsers (Safari, IE, Edge).
Waiting for SIMD
Modern CPUs provide SIMD instructions (“Single Instruction Multiple Data”) which can really optimize multimedia operations where you need to do the same thing a lot of times on parallel data.
But, my main targets (Safari, IE, Edge) don’t support SIMD in JS yet so I haven’t started…
The obvious next thing to ask is “Hey what about the GPU?” Modern computers come with amazing high-throughput parallel-processing graphics units, and it’s become quite the rage to GPU accelerate everything from graphics to spreadsheets.
The good news is that current versions of all main browsers support WebGL, and ogv.js uses it if available to accelerate drawing and YCbCr-RGB colorspace conversion.
The bad news is that’s all we use it for so far — the actual video decoding is all on the CPU.
It should be possible to use the GPU for at least parts of the video decoding steps. But, it’s going to require jumping through some hoops…
WebGL doesn’t provide general-purpose compute shaders, so would have to shovel data into textures and squish computation into fragment shaders meant for processing pixels.
WebGL is only available on the main thread, so if decoding is done in a worker there’ll be additional overhead shipping data between threads
If we have to read data back from the GPU, that can be slow and block the CPU, dropping efficiency again
The codec libraries aren’t really set up with good GPU offloading points in them, so this may be Hard To Do.
Or simply if the integration code’s automatic benchmark overestimated your browser speed, running too much decoding just made everything crap.
Luckily, the HTML5 web platform has a solution — Web Workers.
The limitation is that scripts running in a Worker have no direct access to your code or data running in the web page’s main thread — you can communicate only by sending messages with ‘raw’ data types. Folks working with lots of DOM browser nodes thus can’t get much benefit, but for buffer-driven compute tasks like media decoding it’s perfect!
Threading comms overhead
My first attempt was to take the existing decoder class (an emscripten module containing the Ogg demuxer, Theora video decoder, and Vorbis audio decoder, etc) and run it in the worker thread, with a proxy object sending data and updated object properties back and forth.
This required a little refactoring to make the decoder interfaces asynchronous, taking callbacks instead of returning results immediately.
It worked pretty well, but there was a lot of overhead due to the demuxer requiring frequent back-and-forth calls — after every processing churn, we had to wait for the demuxer to return its updated status to us on the main thread.
This only took a fraction of a millisecond each time, but a bunch of those add up when your per-frame budget is 1/30 (or even 1/60) second!
I had been intending a bigger refactor of the code anyway to use separate emscripten modules for the demuxer and audio/video decoders — this means you don’t have to load code you won’t need, like the Opus audio decoder when you’re only playing Vorbis files.
It also means I could change the coupling, keeping the demuxer on the main thread and moving just the audio/video decoders to workers.
This gives me full speed going back-and-forth on the demuxer, while the decoders can switch to a more “streaming” behavior, sending packets down to be decoded and then displaying the frames or queueing the audio whenever it comes back, without having to wait on it for the next processing iteration.
The result is pretty awesome — in particular on older Windows machines, IE 11 has to use the Flash plugin to do audio and I was previously seeing a lot of “stuttery” behavior when the video decode blocked the Flash audio queueing or vice versa… now it’s much smoother.
The main bug left in the worker mode is that my audio/video sync handling code doesn’t properly handle the case where video decoding is consistently too slow — when we were on the main thread, this caused the audio to halt due to the main thread being blocked; now the audio just keeps on going and the video keeps playing as fast as it can and never catches up. :)
However this should be easy to fix, and having it be wrong but NOT FREEZING YOUR BROWSER is an improvement over having sync but FREEZING YOUR BROWSER. :)
In cleaning it up for release, I’ve noticed some performance regressions on IE and Edge due to cleaning out old code I thought was no longer needed.
There were a fair number of folks interested in video chatting at Wikimania! A few quick updates:
An experimental ‘Schnittserver’ (‘Clip server’) project has been in the works for a while with some funding from ze Germans; currently sitting at http://wikimedia.meltvideo.com/ (uses OAuth, has a temporary SSL cert, UI is very primitive!) It is currently usable already for converting MP4 etc source footage to WebM!
The Schnittserver can also do server-side rendering of projects using the ‘melt’ format such as those created with Kdenlive and Shotcut — this allows uploading your original footage (usually in some sort of MP4/H.264 flavor) and sharing the editing project via WebM proxy clips, without generational loss on the final rendering.
Once rendered, your final WebM output can be published up to Commons.
I would love to see some more support for this project, including adding a better web front-end for managing projects/clips and even editing…
Mozilla has an in-browser media editor thing called Popcorn.js; they’re unfortunately reducing investment in the project, but there’s some talk among people working on it and on our end that Wikimedia might be interested in helping adapt it to work with the Schnittserver or some future replacement for it.
Unfortunately I missed the session with the person working on Popcorn.js, will have to catch up later on it!
Recently fixed some major sound sync bugs on slower devices, and am finishing up controls which will be used in the mobile view (when not using the full TimedMediaHandler / MwEmbedPlayer interface which we still have on the desktop).
A slightly older version of ogv.js is also running on https://ogvjs-testing.wmflabs.org/ with integration into TimedMediaHandler; I’ll update those patches with my 1.0 release next week or so.
I had a talk with Faidon about video requirements on the low-level infrastructure layer; there are some things we need to work on before we really push video:
– seeking/streaming a file with Range subsets causes requests to bypass the Varnish cache layer, potentially causing huge performance problems if there’s a usage spike!
– very large files can’t be sharded cleanly over multiple servers, which makes for further performance bottlenecks on popular files again
– VERY large files (>4G or so) can’t be stored at all; which is a problem for high-quality uploads of things like long Wikimania talks!
For derivative transcodes, we can bypass some of these problems by chunking the output into multiple files of limited length and rigging up ‘gapless playback’, as can be done for HLS or MPEG-DASH-style live streaming. I’m pretty sure I can work out how to do this in the ogv.js player (for Safari and IE) as well as in the native <video> element playback for Chrome and Firefox via Media Source Extensions. Assuming it works with the standard DASH profile for WebM, this is something we can easily make work on Android as well using Google’s ExoPlayer.
DASH playback will also make it easier to use adaptive source switching to handle limited bandwidth or CPU resources.
However we still need to be able to deal with source files which may be potentially quite large…
I’ve been cleaning up some of my old test code for running Ogg media on iOS, adding WebM support and turning it into OGVKit, a (soon-to-be) reusable library that we can use to finally add video and audio playback to our Wikipedia iPhone app.
Of course decoding VP8 or Theora video on the CPU is going to be more expensive in terms of energy usage than decoding H.264 in dedicated silicon… but how much more?
The iOS 9 beta SDK supports enhanced energy monitoring in Xcode 7 beta… let’s try it out! The diagnostic detail screen looks like so:
Whoa! That’s a little overwhelming. What’s actually going on here?
First, what’s going on here
I’ve got my OGVKit demo app playing this video “Curiosity’s Seven Minutes of Terror” found on Wikimedia Commons, on two devices running iOS 9 beta: an iPod Touch (the lowest-end currently sold iDevice) and an iPad Air (one generation behind the highest-end currently sold iDevice).
The iPod Touch is playing a modest 360p WebM transcode, while the iPad Air is playing a higher-resolution 720p WebM transcode with its beefier 64-bit CPU:
First look: the cost of networking
At first, the energy usage looks pretty high:
This however is because in addition to media playback we’re buffering umpty-ump megabytes over HTTPS over wifi — as fast as a 150 Mbps cable connection will allow.
Once the download completes, the CPU usage from SSL decoding goes down, the wifi reduces its power consumption, and our energy usage relatively flattens.
Now what’s the spot-meter look like?
Pretty cool, right!?
See approximate reported energy usage levels for all transcode formats (Ogg Theora and WebM at various resolutions) if you like! Ogg Theora is a little faster to decode but WebM looks significantly better at the bitrates we use.
Ok but how’s that compare to native H.264 playback?
Good question. I’m about to try it and find out.
Ok here’s what we got:
The native AVPlayer downloads smaller chunks more slowly, but similarly shows higher CPU and energy usage during download. Once playing only, reported CPU usage dives to a percent or two and the reported energy impact is “Zero”.
Now, I’m not sure I believe “Zero”… ;)
I suppose I’ll have to rig up some kind of ‘run until the battery dies’ test to compare how reasonable this looks for non-trivial playback times… but the ‘Low’ reportage for WebM at reasonable resolutions makes me happier than ‘Very High’ would have!
I’ve been passing the last few days feverishly working on audio/video stuff, cause it’s been driving me nuts that it’s not quite in working shape.
TL;DR: Major fixes in the works for Android, Safari (iOS and Mac), and IE/Edge (Windows). Need testers and patch reviewers.
ogv.js for Safari/IE/Edge
I’ll want to update it to work with Video.js later, but I’d love to get this version reviewed and deployed in the meantime.
Please head over to https://ogvjs-testing.wmflabs.org/ in Safari 6.1+ or IE 10+ (or ‘Project Spartan’ on Windows 10 preview) and try it out! Particularly interested in cases where it doesn’t work or messes up.
However these get really bad compression ratios, so to keep bandwidth down similar to the 360p Ogg and WebM versions I had to reduce quality and resolution significantly. Hold an iPhone at arm’s length and it’s maybe ok, but zoom full-screen on your iPad and you’ll hate the giant blurry pixels!
This should also provide a working basic audio/video experience in our Wikipedia iOS app, until such time as we integrate Ogg or WebM decoding natively into the app.
Note that it seems tricky to bulk-run new transcodes on old files with TimedMediaHandler. I assume there’s a convenient way to do it that I just haven’t found in the extension maint scripts…
In progress: mobile video fixes
Audio has worked on Android for a while — the .ogg files show up in native <audio> elements and Just Work.
But video has been often broken, with TimedMediaHandler’s “popup transforms” reducing most video embeds into a thumbnail and a link to the original file — which might play if WebM (not if Ogg Theora) but it might also be a 1080p original which you don’t want to pull down on 3G! And neither audio nor video has worked on iOS.
This patch adds a simple mobile target for TMH, which fixes the popup transforms to look better and actually work by loading up an embedded-size player with the appropriately playable transcodes (WebM, Ogg, and the MJPEG last-ditch fallback).
ogv.js is used if available and necessary, for instance in iOS Safari when the CPU is fast enough. (Known to work only on 64-bit models.)
Future: codec.js and WebM and OGVKit
For the future, I’m also working on extending ogv.js to support WebM for better quality (especially in high-motion scenes) — once that stabilizes I’ll rename the combined package codec.js. Performance of WebM is not yet good enough to deploy, and some features like seeking are still missing, but breaking out the codec modules means I can develop the codecs in parallel and keep the high-level player logic in common.
Browser infrastructure improvements like SIMD, threading, and more GPU access should continue to make WebM decoding faster in the future as well.
I’d also like to finish up my OGVKit package for iOS, so we can embed a basic audio/video player at full quality into the Wikipedia iOS app. This needs some more cleanup work still.