Drawing uncompressed YUV frames on iOS with AVSampleBufferDisplayLayer

One of my little projects is OGVKit, a library for playing Ogg and WebM media on iOS, which at some point I want to integrate into the Wikipedia app to fix audio/video playback in articles. (We don’t use MP4/H.264 due to patent licensing concerns, but Apple doesn’t support these formats, so we have to jump through some hoops…)

A trick with working with digital video is that video frames are usually processed, compressed, and stored using the YUV (aka Y’CbCr) colorspace instead of the RGB used in the rest of the digital display pipeline.

This means that you can’t just take the output from a video decoder and blit it to the screen — you need to know how to dig out the pixel data and recombine it into RGB first.

Currently OGVKit draws frames using OpenGL ES, manually attaching the YUV planes as separate textures and doing conversion to RGB in a shader — I actually ported it over from ogv.js‘s WebGL drawing code. But surely a system like iOS with pervasive hardware-accelerated video playback already has some handy way to draw YUV frames?

While researching working with system-standard CMSampleBuffer objects to replace my custom OGVVideoBuffer class, I discovered that iOS 8 and later (and macOS version something) do have a such handy output path: AVSampleBufferDisplayLayer. This guy has three special tricks:

  • CMSampleBuffer objects go in, pretty pictures on screen come out!
  • Can manage a queue of buffers, synchronizing display times to a provided clock!
  • If you pass compressed H.264 buffers, it handles decompression transparently!

I’m decompressing from a format AVFoundation doesn’t grok so the transparent decompression isn’t interesting to me, but since it claimed to accept uncompressed buffers too I figured this might simplify my display output path…

The queue system sounds like it might simplify my timing and state management, but is a bigger change to my code to make so I haven’t tried it yet. You can also tell it to display one frame at a time, which means I can use my existing timing code for now.

There are however two major caveats:

  • AVSampleBufferDisplayLayer isn’t available on tvOS… so I’ll probably end up repackaging the OpenGL output path as an AVSampleBufferDisplayLayer lookalike eventually to try an Apple TV port. 🙂
  • Uncompressed frames must be in a very particular format or you get no visible output and no error messages.

Specifically, it wants a CMSampleBuffer backed by a CVPixelBuffer that’s IOSurface-backed, using bi-planar YUV 4:2:0 pixel format (kCVPixelFormatType_420YpCbCr8BiPlanarVideoRange
or kCVPixelFormatType_420YpCbCr8BiPlanarFullRange). However libtheora and libvpx produce output in traditional tri-planar format, with separate Y, U and V planes. This meant I had to create buffers in appropriate format with appropriate backing memory, copy the Y plane, and then interleave the U and V planes into a single chroma muddle.

My first super-naive attempt took 10ms per 1080p frame to copy on an iPad Pro, which pretty solidly negated any benefits of using a system utility. Then I realized I had a really crappy loop around every pixel. 😉

Using memcpy — a highly optimized system function — to copy the luma lines cut the time down to 3-4ms per frame. A little loop unrolling on the chroma interleave brought it to 2-3ms, and I was able to get it down to about 1ms per frame using a couple ARM-specific vector intrinsic functions, inspired by assembly code I found googling around for YUV layout conversions.

It turns out you can interleave 8 pixels at a time in three instructions using two vector reads and one write, and I didn’t even have to dive into actual assembly:

static inline void interleave_chroma(unsigned char *chromaCbIn, unsigned char *chromaCrIn, unsigned char *chromaOut) {
#if defined(__arm64) || defined(__arm)
    uint8x8x2_t tmp = { val: { vld1_u8(chromaCbIn), vld1_u8(chromaCrIn) } };
    vst2_u8(chromaOut, tmp);
#else
    chromaOut[0] = chromaCbIn[0];
    chromaOut[1] = chromaCrIn[0];
    chromaOut[2] = chromaCbIn[1];
    chromaOut[3] = chromaCrIn[1];
    chromaOut[4] = chromaCbIn[2];
    chromaOut[5] = chromaCrIn[2];
    chromaOut[6] = chromaCbIn[3];
    chromaOut[7] = chromaCrIn[3];
    chromaOut[8] = chromaCbIn[4];
    chromaOut[9] = chromaCrIn[4];
    chromaOut[10] = chromaCbIn[5];
    chromaOut[11] = chromaCrIn[5];
    chromaOut[12] = chromaCbIn[6];
    chromaOut[13] = chromaCrIn[6];
    chromaOut[14] = chromaCbIn[7];
    chromaOut[15] = chromaCrIn[7];
#endif
}

This might be even faster if copying is done on a “slice” basis during decoding, while the bits of the frame being copied are in cache, but I haven’t tried this yet.

With the more efficient copies, the AVSampleBufferDisplayLayer-based output doesn’t seem to use more CPU than the OpenGL version, and using CMSampleBuffers should allow me to take output from the Ogg and WebM decoders and feed it directly into an AVAssetWriter for conversion into MP4… from there it’s a hop, skip and a jump to going the other way, converting on-device MP4 videos into WebM for upload to Wikimedia Commons…