Limitations of AVSampleBufferDisplayLayer on iOS

In my last post I described using AVSampleBufferDisplayLayer to output manually-uncompressed YUV video frames in an iOS app, for playing WebM and Ogg files from Wikimedia Commons. After further experimentation I’ve decided to instead stick with using OpenGL ES directly, and here’s why…

  • 640×360 output regularly displays with a weird horizontal offset corruption on iPad Pro 9.7″. Bug filed as rdar://29810344
  • Can’t get any pixel format with 4:4:4 subsampling to display. Theora and VP9 both support 4:4:4 subsampling, so that made some files unplayable.
  • Core Video pixel buffers for 4:2:2 and 4:4:4 are packed formats, and it prefers 4:2:0 to be a weird biplanar semi-packed format. This requires conversion from the planar output I already have, which may be cheap with Neon instructions but isn’t free.

Instead, I’m treating each plane as a separate one-channel grayscale image, which works for any chroma subsampling ratios. I’m using some Core Video bits (CVPixelBufferPool and CVOpenGLESTextureCache) to do texture setup instead of manually calling glTeximage2d with a raw source blob, which improves a few things:

  • Can do CPU->GPU memory copy off main thread easily, without worrying about locking my GL context.
  • No pixel format conversions, so straight memcpy for each line…
  • Buffer pools are tied to the video buffer’s format object, and get swapped out automatically when the format changes (new file, or file changes resolution).
  • Don’t have to manually account for stride != width in the texture setup!

It could be more efficient still if I could pre-allocate CVPixelBuffers with on-GPU memory and hand them to libvpx and libtheora to decode into… but they currently lack sufficient interfaces to accept frame buffers with GPU-allocated sizes.

A few other oddities I noticed:

  • The clean aperture rectangle setting doesn’t seem to be preserved when creating a CVPixelBuffer via CVPixelBufferPool; I have to re-set it when creating new buffers.
  • For grayscale buffers, the clean aperture doesn’t seem to be picked up by CVOpenGLESTextureGetCleanTexCoords. Not sure if this is only supposed to work with Y’CbCr buffer types or what… however I already have all these numbers in my format object and just pull from there. :)

I also fell down a rabbit hole researching color space issues after noticing that some of the video formats support multiple colorspace variants that may imply different RGB conversion matrices… and maybe gamma…. and what do R, G, and B mean anyway? :) Deserves another post sometime.

 

 

Drawing uncompressed YUV frames on iOS with AVSampleBufferDisplayLayer

One of my little projects is OGVKit, a library for playing Ogg and WebM media on iOS, which at some point I want to integrate into the Wikipedia app to fix audio/video playback in articles. (We don’t use MP4/H.264 due to patent licensing concerns, but Apple doesn’t support these formats, so we have to jump through some hoops…)

A trick with working with digital video is that video frames are usually processed, compressed, and stored using the YUV (aka Y’CbCr) colorspace instead of the RGB used in the rest of the digital display pipeline.

This means that you can’t just take the output from a video decoder and blit it to the screen — you need to know how to dig out the pixel data and recombine it into RGB first.

Currently OGVKit draws frames using OpenGL ES, manually attaching the YUV planes as separate textures and doing conversion to RGB in a shader — I actually ported it over from ogv.js‘s WebGL drawing code. But surely a system like iOS with pervasive hardware-accelerated video playback already has some handy way to draw YUV frames?

While researching working with system-standard CMSampleBuffer objects to replace my custom OGVVideoBuffer class, I discovered that iOS 8 and later (and macOS version something) do have a such handy output path: AVSampleBufferDisplayLayer. This guy has three special tricks:

  • CMSampleBuffer objects go in, pretty pictures on screen come out!
  • Can manage a queue of buffers, synchronizing display times to a provided clock!
  • If you pass compressed H.264 buffers, it handles decompression transparently!

I’m decompressing from a format AVFoundation doesn’t grok so the transparent decompression isn’t interesting to me, but since it claimed to accept uncompressed buffers too I figured this might simplify my display output path…

The queue system sounds like it might simplify my timing and state management, but is a bigger change to my code to make so I haven’t tried it yet. You can also tell it to display one frame at a time, which means I can use my existing timing code for now.

There are however two major caveats:

  • AVSampleBufferDisplayLayer isn’t available on tvOS… so I’ll probably end up repackaging the OpenGL output path as an AVSampleBufferDisplayLayer lookalike eventually to try an Apple TV port. :)
  • Uncompressed frames must be in a very particular format or you get no visible output and no error messages.

Specifically, it wants a CMSampleBuffer backed by a CVPixelBuffer that’s IOSurface-backed, using bi-planar YUV 4:2:0 pixel format (kCVPixelFormatType_420YpCbCr8BiPlanarVideoRange
or kCVPixelFormatType_420YpCbCr8BiPlanarFullRange). However libtheora and libvpx produce output in traditional tri-planar format, with separate Y, U and V planes. This meant I had to create buffers in appropriate format with appropriate backing memory, copy the Y plane, and then interleave the U and V planes into a single chroma muddle.

My first super-naive attempt took 10ms per 1080p frame to copy on an iPad Pro, which pretty solidly negated any benefits of using a system utility. Then I realized I had a really crappy loop around every pixel. ;)

Using memcpy — a highly optimized system function — to copy the luma lines cut the time down to 3-4ms per frame. A little loop unrolling on the chroma interleave brought it to 2-3ms, and I was able to get it down to about 1ms per frame using a couple ARM-specific vector intrinsic functions, inspired by assembly code I found googling around for YUV layout conversions.

It turns out you can interleave 8 pixels at a time in three instructions using two vector reads and one write, and I didn’t even have to dive into actual assembly:

static inline void interleave_chroma(unsigned char *chromaCbIn, unsigned char *chromaCrIn, unsigned char *chromaOut) {
#if defined(__arm64) || defined(__arm)
    uint8x8x2_t tmp = { val: { vld1_u8(chromaCbIn), vld1_u8(chromaCrIn) } };
    vst2_u8(chromaOut, tmp);
#else
    chromaOut[0] = chromaCbIn[0];
    chromaOut[1] = chromaCrIn[0];
    chromaOut[2] = chromaCbIn[1];
    chromaOut[3] = chromaCrIn[1];
    chromaOut[4] = chromaCbIn[2];
    chromaOut[5] = chromaCrIn[2];
    chromaOut[6] = chromaCbIn[3];
    chromaOut[7] = chromaCrIn[3];
    chromaOut[8] = chromaCbIn[4];
    chromaOut[9] = chromaCrIn[4];
    chromaOut[10] = chromaCbIn[5];
    chromaOut[11] = chromaCrIn[5];
    chromaOut[12] = chromaCbIn[6];
    chromaOut[13] = chromaCrIn[6];
    chromaOut[14] = chromaCbIn[7];
    chromaOut[15] = chromaCrIn[7];
#endif
}

This might be even faster if copying is done on a “slice” basis during decoding, while the bits of the frame being copied are in cache, but I haven’t tried this yet.

With the more efficient copies, the AVSampleBufferDisplayLayer-based output doesn’t seem to use more CPU than the OpenGL version, and using CMSampleBuffers should allow me to take output from the Ogg and WebM decoders and feed it directly into an AVAssetWriter for conversion into MP4… from there it’s a hop, skip and a jump to going the other way, converting on-device MP4 videos into WebM for upload to Wikimedia Commons…