OGVKit – brionv

In my last post I described using AVSampleBufferDisplayLayer to outputÂ manually-uncompressed YUV video framesÂ in an iOS app, for playing WebM and Ogg files from Wikimedia Commons. After further experimentation I’ve decided to instead stick with using OpenGL ES directly, and here’s why…

640×360 output regularly displays with aÂ weird horizontal offset corruption on iPad Pro 9.7″. Bug filed asÂ rdar://29810344
Can’t get any pixel format with 4:4:4 subsampling to display. Theora and VP9 both support 4:4:4 subsampling, so thatÂ made some files unplayable.
Core Video pixel buffers for 4:2:2 and 4:4:4 are packed formats, and it prefers 4:2:0 to be a weird biplanar semi-packed format. This requires conversion from the planar output I already have, which may be cheap with NeonÂ instructions but isn’t free.

Instead,Â I’m treating each plane as a separate one-channel grayscale image, which works for any chroma subsampling ratios. I’m using some Core Video bits (CVPixelBufferPool and CVOpenGLESTextureCache) to doÂ texture setupÂ instead of manuallyÂ callingÂ glTeximage2d with a raw source blob, which improvesÂ a few things:

Can do CPU->GPU memory copy off main thread easily,Â without worrying about locking my GL context.
No pixel format conversions, so straight memcpy for each line…
Buffer pools are tied to the video buffer’s format object, and get swapped out automaticallyÂ when the formatÂ changes (new file, or file changes resolution).
Don’t have to manually account for stride != width in the texture setup!

ItÂ couldÂ be more efficient still if I could pre-allocate CVPixelBuffers with on-GPU memory and hand them to libvpx and libtheora to decode into… but they currently lack sufficient interfaces to accept frame buffers with GPU-allocated sizes.

A few other oddities I noticed:

The clean aperture rectangle setting doesn’t seem to be preserved when creating a CVPixelBuffer via CVPixelBufferPool; I have to re-set it when creating new buffers.
For grayscaleÂ buffers, the clean aperture doesn’t seem to be picked up by CVOpenGLESTextureGetCleanTexCoords. Not sure if this is only supposed to work with Y’CbCr buffer types or what… however I already have all these numbers in myÂ format object and just pull from there. :)

I also fell down a rabbit hole researching color space issues after noticing thatÂ some of the video formats support multipleÂ colorspace variants that may imply different RGB conversion matrices… and maybe gamma…. and what do R, G, and B mean anyway? :) Deserves another post sometime.

One of my little projects isÂ OGVKit, a library for playingÂ Ogg and WebM mediaÂ on iOS, which at some point I want to integrate into the Wikipedia app to fix audio/video playback in articles. (We don’t use MP4/H.264Â dueÂ to patent licensing concerns, but Apple doesn’t support these formats, so we have to jump through some hoops…)

A trick withÂ working with digital video is that video frames are usually processed,Â compressed, and stored using the YUV (aka Y’CbCr) colorspace instead of the RGB used inÂ the rest of the digital display pipeline.

This means that you can’t just take the output from a video decoder andÂ blit it to the screenÂ — you need to know how toÂ dig out the pixel data and recombine it into RGB first.

Currently OGVKit draws frames using OpenGL ES, manually attaching the YUV planes as separate textures and doing conversion to RGB in a shader — I actually portedÂ it overÂ from ogv.js‘s WebGL drawing code.Â But surely a system like iOS with pervasive hardware-acceleratedÂ video playback already has some handy way to draw YUV frames?

While researching working with system-standard CMSampleBuffer objects to replace my custom OGVVideoBuffer class, I discovered that iOS 8 and later (and macOS version something) do have a such handy output path: AVSampleBufferDisplayLayer. This guy has threeÂ special tricks:

CMSampleBuffer objects go in, pretty pictures on screen come out!
CanÂ manageÂ a queue of buffers,Â synchronizing display times to a providedÂ clock!
If you pass compressed H.264 buffers, itÂ handles decompression transparently!

I’m decompressing from aÂ format AVFoundation doesn’t grok so the transparent decompression isn’t interesting to me, butÂ since it claimed to accept uncompressed buffers too I figured this might simplify myÂ display output path…

The queue system sounds like it might simplify my timing and state management, but is a bigger change to my code to make so I haven’t tried it yet. You can also tell it to display one frame at a time, which means I can use my existing timing code for now.

There are however two major caveats:

AVSampleBufferDisplayLayer isn’t available on tvOS… so I’ll probably end upÂ repackaging the OpenGL output path as an AVSampleBufferDisplayLayer lookalike eventually to try an Apple TV port. :)
Uncompressed frames must be in a very particular format or you get no visible output and no error messages.

Specifically, it wants a CMSampleBuffer backed by a CVPixelBufferÂ that’sÂ IOSurface-backed,Â using bi-planar YUV 4:2:0 pixel format (kCVPixelFormatType_420YpCbCr8BiPlanarVideoRange
or kCVPixelFormatType_420YpCbCr8BiPlanarFullRange). However libtheora and libvpx produce output in traditionalÂ tri-planar format, with separate Y, U and V planes.Â This meant I had toÂ createÂ buffers in appropriate format with appropriate backing memory, copy the Y plane, and thenÂ interleave the U and V planes into a single chroma muddle.

My first super-naive attempt took 10ms per 1080p frame to copy on an iPad Pro, which pretty solidly negated any benefits of using a system utility. Then I realized IÂ had a really crappy loop around every pixel. ;)

Using memcpy — a highly optimized system function — to copy the luma lines cut the time down to 3-4ms per frame. AÂ little loop unrolling on the chroma interleave brought it to 2-3ms, and I was able to get it down to about 1ms per frame usingÂ a coupleÂ ARM-specific vectorÂ intrinsic functions, inspired by assembly code I found googling around forÂ YUV layout conversions.

It turns out you canÂ interleave 8 pixels at a time in three instructions using two vector reads and one write, and I didn’t even have to dive into actual assembly:

static inline void interleave_chroma(unsigned char *chromaCbIn, unsigned char *chromaCrIn, unsigned char *chromaOut) {
#if defined(__arm64) || defined(__arm)
    uint8x8x2_t tmp = { val: { vld1_u8(chromaCbIn), vld1_u8(chromaCrIn) } };
    vst2_u8(chromaOut, tmp);
#else
    chromaOut[0] = chromaCbIn[0];
    chromaOut[1] = chromaCrIn[0];
    chromaOut[2] = chromaCbIn[1];
    chromaOut[3] = chromaCrIn[1];
    chromaOut[4] = chromaCbIn[2];
    chromaOut[5] = chromaCrIn[2];
    chromaOut[6] = chromaCbIn[3];
    chromaOut[7] = chromaCrIn[3];
    chromaOut[8] = chromaCbIn[4];
    chromaOut[9] = chromaCrIn[4];
    chromaOut[10] = chromaCbIn[5];
    chromaOut[11] = chromaCrIn[5];
    chromaOut[12] = chromaCbIn[6];
    chromaOut[13] = chromaCrIn[6];
    chromaOut[14] = chromaCbIn[7];
    chromaOut[15] = chromaCrIn[7];
#endif
}

ThisÂ mightÂ be even faster if copying is doneÂ on a “slice” basis during decoding, while the bits of the frame being copied are in cache, but I haven’t tried this yet.

With the more efficient copies, the AVSampleBufferDisplayLayer-based outputÂ doesn’t seem to use more CPU than the OpenGL version, and using CMSampleBuffers should allow me to take output from the Ogg and WebM decoders and feed it directly into an AVAssetWriter forÂ conversion into MP4… from there it’s a hop, skip and a jump to going the other way, converting on-device MP4 videos into WebM for upload toÂ Wikimedia Commons…

Tag: OGVKit

Limitations of AVSampleBufferDisplayLayer on iOS

Drawing uncompressed YUV frames on iOS with AVSampleBufferDisplayLayer