Drawing uncompressed YUV frames on iOS with AVSampleBufferDisplayLayer

One of my little projects isÂ OGVKit, a library for playingÂ Ogg and WebM mediaÂ on iOS, which at some point I want to integrate into the Wikipedia app to fix audio/video playback in articles. (We don’t use MP4/H.264Â dueÂ to patent licensing concerns, but Apple doesn’t support these formats, so we have to jump through some hoops…)

A trick withÂ working with digital video is that video frames are usually processed,Â compressed, and stored using the YUV (aka Y’CbCr) colorspace instead of the RGB used inÂ the rest of the digital display pipeline.

This means that you can’t just take the output from a video decoder andÂ blit it to the screenÂ — you need to know how toÂ dig out the pixel data and recombine it into RGB first.

Currently OGVKit draws frames using OpenGL ES, manually attaching the YUV planes as separate textures and doing conversion to RGB in a shader — I actually portedÂ it overÂ from ogv.js‘s WebGL drawing code.Â But surely a system like iOS with pervasive hardware-acceleratedÂ video playback already has some handy way to draw YUV frames?

While researching working with system-standard CMSampleBuffer objects to replace my custom OGVVideoBuffer class, I discovered that iOS 8 and later (and macOS version something) do have a such handy output path: AVSampleBufferDisplayLayer. This guy has threeÂ special tricks:

CMSampleBuffer objects go in, pretty pictures on screen come out!
CanÂ manageÂ a queue of buffers,Â synchronizing display times to a providedÂ clock!
If you pass compressed H.264 buffers, itÂ handles decompression transparently!

I’m decompressing from aÂ format AVFoundation doesn’t grok so the transparent decompression isn’t interesting to me, butÂ since it claimed to accept uncompressed buffers too I figured this might simplify myÂ display output path…

The queue system sounds like it might simplify my timing and state management, but is a bigger change to my code to make so I haven’t tried it yet. You can also tell it to display one frame at a time, which means I can use my existing timing code for now.

There are however two major caveats:

AVSampleBufferDisplayLayer isn’t available on tvOS… so I’ll probably end upÂ repackaging the OpenGL output path as an AVSampleBufferDisplayLayer lookalike eventually to try an Apple TV port. :)
Uncompressed frames must be in a very particular format or you get no visible output and no error messages.

Specifically, it wants a CMSampleBuffer backed by a CVPixelBufferÂ that’sÂ IOSurface-backed,Â using bi-planar YUV 4:2:0 pixel format (kCVPixelFormatType_420YpCbCr8BiPlanarVideoRange
or kCVPixelFormatType_420YpCbCr8BiPlanarFullRange). However libtheora and libvpx produce output in traditionalÂ tri-planar format, with separate Y, U and V planes.Â This meant I had toÂ createÂ buffers in appropriate format with appropriate backing memory, copy the Y plane, and thenÂ interleave the U and V planes into a single chroma muddle.

My first super-naive attempt took 10ms per 1080p frame to copy on an iPad Pro, which pretty solidly negated any benefits of using a system utility. Then I realized IÂ had a really crappy loop around every pixel. ;)

Using memcpy — a highly optimized system function — to copy the luma lines cut the time down to 3-4ms per frame. AÂ little loop unrolling on the chroma interleave brought it to 2-3ms, and I was able to get it down to about 1ms per frame usingÂ a coupleÂ ARM-specific vectorÂ intrinsic functions, inspired by assembly code I found googling around forÂ YUV layout conversions.

It turns out you canÂ interleave 8 pixels at a time in three instructions using two vector reads and one write, and I didn’t even have to dive into actual assembly:

static inline void interleave_chroma(unsigned char *chromaCbIn, unsigned char *chromaCrIn, unsigned char *chromaOut) {
#if defined(__arm64) || defined(__arm)
    uint8x8x2_t tmp = { val: { vld1_u8(chromaCbIn), vld1_u8(chromaCrIn) } };
    vst2_u8(chromaOut, tmp);
#else
    chromaOut[0] = chromaCbIn[0];
    chromaOut[1] = chromaCrIn[0];
    chromaOut[2] = chromaCbIn[1];
    chromaOut[3] = chromaCrIn[1];
    chromaOut[4] = chromaCbIn[2];
    chromaOut[5] = chromaCrIn[2];
    chromaOut[6] = chromaCbIn[3];
    chromaOut[7] = chromaCrIn[3];
    chromaOut[8] = chromaCbIn[4];
    chromaOut[9] = chromaCrIn[4];
    chromaOut[10] = chromaCbIn[5];
    chromaOut[11] = chromaCrIn[5];
    chromaOut[12] = chromaCbIn[6];
    chromaOut[13] = chromaCrIn[6];
    chromaOut[14] = chromaCbIn[7];
    chromaOut[15] = chromaCrIn[7];
#endif
}

ThisÂ mightÂ be even faster if copying is doneÂ on a “slice” basis during decoding, while the bits of the frame being copied are in cache, but I haven’t tried this yet.

With the more efficient copies, the AVSampleBufferDisplayLayer-based outputÂ doesn’t seem to use more CPU than the OpenGL version, and using CMSampleBuffers should allow me to take output from the Ogg and WebM decoders and feed it directly into an AVAssetWriter forÂ conversion into MP4… from there it’s a hop, skip and a jump to going the other way, converting on-device MP4 videos into WebM for upload toÂ Wikimedia Commons…