Thoughts on Ogg adaptive streaming

So I’d like to use adaptive streaming for video playback on Wikipedia and Wikimedia Commons, automatically selecting the appropriate source format and resolution at runtime based on bandwidth and CPU availability.

For Safari, Edge, and IE users, that means figuring out how to rig a Media Source Extensions-like interface into ogv.js to let the streaming handler inject its buffered data into the demuxer and codecs instead of letting the player handle its own buffering.

It also means I have to figure out how to do adaptive stream switching for Ogg streams and Theora video, since WebM VP8 still decodes too slowly in ogv.js to rely on for deployment…

Theory vs Theora

At its base, adaptive streaming relies on the ability to feed the decoders with data from another stream without them freaking out and demanding a pause or reset. We can either read packets from a subset of a monolithic file for each source, or from a bunch of tiny segmented files.

In order to do this, generally you need to switch on video keyframe boundaries: each keyframe represents a point in the data stream where the video decoder can reset its state.

For WebM with VP8 and VP9 codecs, the decoders are pretty good at this. As long as you came in on a keyframe boundary you can just start feeding it packets at a new resolution and it’ll happily output frames at the new resolution.

For Ogg Theora, there are a few major impediments.

Ogg stream serial numbers

At the Ogg stream level: each Ogg logical bitstream gets a random serial number; those serial numbers will not match across separate encodings at different resolutions.

Ogg explicitly allows for “chaining” of complete bitstreams, where one ends and you just tack another on, but we’re not quite doing that here… We want to be able to switch partway through with minimal interruption.

For Vorbis audio, this might require some work if pulling audio+video together from combined .ogv files, but it gets simpler if there’s one .oga audio stream and separate video-only .ogv streams — we’d essentially have separate demuxer contexts for audio and video, and would not need to meddle with the audio.

For the Theora video stream this is probably ok too, since when we reach a switch boundary we also need to feed the decoder with…

Header packets

Every Theora video stream sets up start codes at the beginning of the stream in its three header packets. This means that encodings of the same video at different resolutions will have different header setup.

So, when we switch sources we’ll need to reinitialize the Theora decoder with the header packets from the target stream; then it should be safe to feed new packets into it from our arbitrary start position.

This isn’t a super exotic requirement; I’ve seen some provision for ‘start codes’ for MP4 adaptive streaming too.

Keyframe timing

More worrisome is that keyframe timing is not predictable in a Theora stream. This is actually due to the libtheora encoder internals — it allows you to specify a maximum keyframe interval, but it may decide at any time to insert a keyframe on its own if it thinks it’s more efficient to store a frame that way, at which point the interval starts counting from there instead of the last scheduled keyframe.

Since this heuristic is determined based on actual frame data, the early keyframes will appear in different times and places for renderings at different resolutions… And so will every keyframe following them.

This means you don’t have switch points that are consistent between sources, breaking the whole model!

It looks like a keyframe can be forced by changing the keyframe interval to 1 right before a desire keyframe, then changing it back to the desired value after. This would result in still getting some early keyframes at unpredictable times, but then also getting predictable ones. As long as the switchover points aren’t too often, that’s probably fine — just keep decoding over the extra keyframes, but only switch/segment on the predictable ones.

Streams vs split files

Another note: it’s possible to either store data as one long file per source, or to split it up into small chunk files at each keyframe boundary.

Chunk files are nice because they can be streamed easily without use of the HTTP ‘Range’ header and they’re friendly to cache layers. Long files can be easier to manage on the server, but Wikimedia ops folks have told me that the way large files are stored doesn’t always interact ideally with the caching layer and they’d be much happier with split chunk files!

A downside of chunks is that it’s harder to download a complete copy of a file at a given resolution for offline playback. But, if we split audio and video tracks we’re in a world where that’s hard anyway… Can either just say “download the full resolution source then!” Or provide a remuxer to produce combined files for download on the fly from the chunks… :)
The keyframe timing seems the ugliest issue to deal with; may need to patch ffmpeg2theora or ffmpeg to work around it, but at least shouldn’t have to mess with libtheora itself…