May 2016 – brionv

Thoughts on Ogg adaptive streaming

So I’d like to use adaptive streaming for video playback on Wikipedia and Wikimedia Commons, automatically selecting the appropriate source format and resolution at runtime based on bandwidth and CPU availability.

For Safari, Edge, and IE users, that means figuring out how to rig a Media Source Extensions-like interface into ogv.js to let the streaming handler inject its buffered data into the demuxer and codecs instead of letting the player handle its own buffering.

It also means I have to figure out how to do adaptive stream switching for Ogg streams and Theora video, since WebM VP8 still decodes too slowly in ogv.js to rely on for deployment…

Theory vs Theora

At its base, adaptive streaming relies on the ability to feed the decoders with data from another stream without them freaking out and demanding a pause or reset. We can either read packets from a subset of a monolithic file for each source, or from a bunch of tiny segmented files.

In order to do this, generally you need to switch on video keyframe boundaries: each keyframe represents a point in the data stream where the video decoder can reset its state.

For WebM with VP8 and VP9 codecs, the decoders are pretty good at this. As long as you came in on a keyframe boundary you can just start feeding it packets at a new resolution and it’ll happily output frames at the new resolution.

For Ogg Theora, there are a few major impediments.

Ogg stream serial numbers

At the Ogg stream level: each Ogg logical bitstream gets a random serial number; those serial numbers will not match across separate encodings at different resolutions.

Ogg explicitly allows for “chaining” of complete bitstreams, where one ends and you just tack another on, but we’re not quite doing that here… We want to be able to switch partway through with minimal interruption.

For Vorbis audio, this might require some work if pulling audio+video together from combined .ogv files, but it gets simpler if there’s one .oga audio stream and separate video-only .ogv streams — we’d essentially have separate demuxer contexts for audio and video, and would not need to meddle with the audio.

For the Theora video stream this is probably ok too, since when we reach a switch boundary we also need to feed the decoder with…

Header packets

Every Theora video stream sets up start codes at the beginning of the stream in its three header packets. This means that encodings of the same video at different resolutions will have different header setup.

So, when we switch sources we’ll need to reinitialize the Theora decoder with the header packets from the target stream; then it should be safe to feed new packets into it from our arbitrary start position.

This isn’t a super exotic requirement; I’ve seen some provision for ‘start codes’ for MP4 adaptive streaming too.

Keyframe timing

More worrisome is that keyframe timing is not predictable in a Theora stream. This is actually due to the libtheora encoder internals — it allows you to specify a maximum keyframe interval, but it may decide at any time to insert a keyframe on its own if it thinks it’s more efficient to store a frame that way, at which point the interval starts counting from there instead of the last scheduled keyframe.

Since this heuristic is determined based on actual frame data, the early keyframes will appear in different times and places for renderings at different resolutions… And so will every keyframe following them.

This means you don’t have switch points that are consistent between sources, breaking the whole model!

It looks like a keyframe can be forced by changing the keyframe interval to 1 right before a desire keyframe, then changing it back to the desired value after. This would result in still getting some early keyframes at unpredictable times, but then also getting predictable ones. As long as the switchover points aren’t too often, that’s probably fine — just keep decoding over the extra keyframes, but only switch/segment on the predictable ones.

Streams vs split files

Another note: it’s possible to either store data as one long file per source, or to split it up into small chunk files at each keyframe boundary.

Chunk files are nice because they can be streamed easily without use of the HTTP ‘Range’ header and they’re friendly to cache layers. Long files can be easier to manage on the server, but Wikimedia ops folks have told me that the way large files are stored doesn’t always interact ideally with the caching layer and they’d be much happier with split chunk files!

A downside of chunks is that it’s harder to download a complete copy of a file at a given resolution for offline playback. But, if we split audio and video tracks we’re in a world where that’s hard anyway… Can either just say “download the full resolution source then!” Or provide a remuxer to produce combined files for download on the fly from the chunks… :)
The keyframe timing seems the ugliest issue to deal with; may need to patch ffmpeg2theora or ffmpeg to work around it, but at least shouldn’t have to mess with libtheora itself…

WebM seeking coming in ogv.js 1.1.2

Seeking in WebM playback will finally be supported in ogv.js 1.1.2. Yay! Try it out!

This took some fancy footwork, as the demuxer library I’m using (nestegg) only provides a synchronous i/o-using interface for seeking: on the first seek, it needed to be able to first do a seek to the location of the cues in the file (they’re usually at the end in WebM, not at the beginning!), then read in the cue data, and then run a second seek to the start of a cluster.

On examining the innards of the library I found that it’s fairly safe to ‘restart’ the operation directly after the seek attempts, which saved me the trouble of patching the library code; though I may come up with a patch to more cleanly & reliably do this.

For the initial hack, I have my i/o callbacks detect that an attempt was made to seek outside the buffer range, and then when the nestegg library function fails out, the demuxer sees the seek attempt and passes it back to the caller (OGVPlayer) which is able to trigger a seek on the actual, asynchronous i/o layer. Once the new data starts arriving, we call back into the demuxer to read the cues or confirm that we’ve seeked to the right place and can continue decoding. As an additional protection against the library freaking out, I make sure that the entire cues element has been buffered before attempting to read it.

I’ve also fixed a bug that caused WebM playback to occasionally die with an out of memory error. This one was caused by not freeing packet data in the demuxer. *headdesk* At least that bug was easier. ;)

This gets WebM VP8 playback working about as well as Ogg Theora playback, except for the decoding being about 5x slower! Well, I’ve got plans for that. :D

VP8 parallelization continued

Following up on earlier musings… Found an interesting analysis by someone who did some work in this area a few years ago.

Based on the way data flows from Â macroblocks adjacent above or to the left, it is safe to parallelize multiple super blocks in diagonal lines running from top-right towards the bottom-left. This means you start with a batch of one macroblock, then a batch of two, then three…. Up to the maximum diagonal dimension (30 blocks at 480p or 68 at 1080p) then back down again to one at the far bottom right corner. Hmm, not exactly diagonal, actually half diagonal?

(Based on my reading the filter stage doesn’t need to access the above-right block which simplifies things, but I could be wrong… Intraframe prediction looks scarier though and may need the different angle cut. Anyway the half diagonal order should work for the entire set of all operations…).

This breakdown should be suitable both for CPU worker-based threading (breaking the batches into smaller sub-batches per core) and for WebGL shader based GPU work (where each large batch would issue several draw calls, each processing up to the full batch size of macroblocks).

For CPU work there could also be pipelining between the data stages, though if doing full batching that may not be necessary.

Data locality and latency

On modern computers, accessing memory is often the slowest part of a naively written calculation. If the memory you’re working with isn’t in cache, fetching it can be verrrry slow. And if you need to transfer data between the CPU and GPU things get even nastier.

Unpacking the entropy-coded data structures for an entire frames worth of macroblocks then processing them in a different order sounds like it might be expensive. But it shouldn’t be insanely so; most processing will be local to each block.

For GPU usage, data also flows mostly one way from the CPU into textures and arrays uploaded into the GPU, where random texel access should be fast. We shouldn’t need to read any of that data back to the CPU until the end of the loop filter, when we read the YUV buffers back to feed into the next frame’s predictions and output to the player.

What we do need for the GPU is to read the results of each batch’s computations back into the next batch. As long as I can render into a texture and use that texture as a source for the next call I think that all works as I need it.

Loop filter madness

The loop filter stage reduces artifacting at the edges of macroblocks and sub-blocks from motion prediction and DCT fun. Because it’s very precisely specced, and the output feeds back into the next frame’s predictions, it’s important to get this right.

per spec, each macroblock may apply up to four filter passes:

left edge

subblock left edges

top edge

subblock top edges

Along each edge, for each pixel along the edge the boundaries are tested for a threshold and a filter may or may not be applied, variously to one, two, or three pixels deep from the edge. For the macroblock edges, when the filter applies it also modifies the adjacent block data!

It’s all pretty funky, but it looks parallelizable within limits.

Ah, now to find time to research this further while still getting other stuff done. ;)

Peeking into VP8 video decoding performance

The ogv.js distribution includes Ogg Theora video decoding, which we use on Wikipedia to play back our media files in Safari, IE, and Edge, but also has an experimental mode for WebM files with VP8 video.

WebM/VP8 is better supported by many tools, and provides higher quality video at lower bandwidth usage than Ogg Theora… There are two major reasons I haven’t taken WebM out of “experimental” status:

The demuxer library (nestegg) was hacked in quickly and I need to add support for seeking, “not crashing”, etc
Decoding VP8 video is several times slower than decoding Theora at the same resolution.

The first is a simple engineering problem; I can either hack up nestegg or replace it with something that fits the i/o model I’m working with better.

The second is an intrinsic complexity problem: although the two formats are broadly similar technologies, VP8 requires more computation to decode a frameÂ than Theora does.

Add to this the fact that we have some environmental limitations on common CPU optimizations for parallelizable code in the C libraries:

JavaScript “Worker” threads are different from the low-level pthreads threading model used by C code, and the interfaces required to more closely emulate it are not yet available in Safari or Edge.
SIMD (“Same Instruction Multiple Data”) processing is not available in Safari, and not yet production-enabled in Edge.

So, if we can’t use SIMD instructions to parallelize tiny bits of processing, and we can’t simply crank up multithreading to use a now-ubiquitous second CPU core, how can we split up the work?

The first step, I think, is to determine some data boundaries where we might hope to be able to export data from the libvpx library before it’s fully done processing.

What’s what

The VP8 decoder & bitstream format is defined in RFC 6386, from which we can roughly divide decoding into four stages:

…decode input stream entropy encoding…
reconstruct base signal by applying motion vectors against a reference frame
reconstruct residual signal by applying inverse DCT
apply loop filter

The entropy encoding isn’t really a separate stage, but happens as other data gets read in and then fed into each other stage. It’s not really parallelizable itself either.

Aiming high

So before I go researching more on trying to optimize some of these steps, what’s actually the biggest, slowest stuff to work on?

I did a quick profiling run playing back some 480p WebM in Chrome (pretty good compiler, and the profiling tools support profiling just one worker thread which isolates the VP8 decoder nicely). Broke down the resulting function self-time list by rough category and found:

Filter: 54.40%
Motion: 22.75%
IDCT: 10.55%
Other: 12.31%

(“Other” included any function I didn’t tag, as well as general overhead.)

Ouch! Filtering looks like a good first application of a separate step.

Possible directions – staying on CPU

If we stick with the CPU, we can create further Worker threads and send blocks of data to be filtered. In many cases, even when processing of one macroblock to another is hard to parallelize because of data dependencies, there is natural parallelism in the color system — the Y (“luma”, or brightness) plane can be processed by one thread while the U and V (“chroma”, or color) planes can be processed independently by another.

However splitting processing between luma and chroma has a limited benefit. There’s twice as much luma data as chroma, so you save at most 33% of the runtime here.

Macroblocks and subblocks

VP8 divides most processing & data into “macroblocks” of 16×16 pixels, composed of 24 “subblocks” of 4×4 pixels (that’s 16 subblocks of luma data, plus 4 each from the two lower-resolution chroma planes).

In many cases we can parallelize at the subblock level, which divides 24 evenly into 2 cores, or even 4 or 8 cores! But the most performance-sensitive devices will usually only have 2 CPU cores available, giving us only a 50% potential speedup.

Going crazy – GPU time?

Graphics processors are much more aggressively multithreaded, with multiple cores and ability to handle many more simultaneous operations.

If we run more operations on the GPU, we might actually have a chance to run 24 subblocks all in parallel!

However there’s a lot more fuzziness involved in getting there:

may have to rewrite code from C into GLSL
have to figure out how to operate as fragment or vertex shaders
have to figure out how to efficiently get data in/out
oh and you can’t use WebGL from a Worker thread, so have to bounce data up to the main thread and back down to the codec
once all that’s done, what’s the actual performance of the GPU vs the CPU per-operation? no idea :D

So, there would seem to be great scaling potential on the GPU side, but there are a lot of unknowns.

Worth investigating but we’ll see what’s really feasible…

1000fps no more

Different media file formats encode things like time and frame rates differently… or not at all.

WebM doesn’t list a frame rate; each frame is simply given a position in time. Meanwhile the older Ogg Theora codec defines a consistent, pre-defined frame rate for a stream, but allows frames to be declared as duplicates of the previous frame as an optimization.

At the intersection of these two, some files auto-converted from WebM to Ogg on Wikimedia’s servers end up claiming to encode a “1000 fps” video stream, where nearly all the frames are dupes and there’s actually ~25-30 or at most 60 actual frames per second.

I had to put a hack into my ogv.js player to handle these, because actually trying to draw 1000 frames per second was kind of slow. ;)

https://github.com/brion/ogv.js/issues/349

ogv.js 1.1.0 alpha now on npm

ogv.js 1.1.0-alpha.0 is now available for download:

Big thanks to Stephan Hesse who retooled large chunks of the build system using webpack, which brought us a lot closer to the npm package release.

ogv.js 1.1.0 is a drop-in update to 1.0; many internal classes are no longer leaked to global namespace, but the public OGV* classes remain as they are.

The internal AudioFeeder class is also available as a separate npm audio-feeder package; more internal classes will follow including the streaming URL reader and the WebGL-accelerated YUV canvas.

In addition to internal/build changes, this release has major fixes for seeking in ogg files, implements the volume property, and adds support for more properties and events (not yet 100% up to spec, but closer).

After a few more days shaking this out I’ll push it up to MediaWiki’s TimedMediaHandler extension, where it’ll make it to Wikipedia and Wikimedia Commons.