Following up on earlier musings… Found an interesting analysis by someone who did some work in this area a few years ago.
Based on the way data flows from macroblocks adjacent above or to the left, it is safe to parallelize multiple super blocks in diagonal lines running from top-right towards the bottom-left. This means you start with a batch of one macroblock, then a batch of two, then three…. Up to the maximum diagonal dimension (30 blocks at 480p or 68 at 1080p) then back down again to one at the far bottom right corner. Hmm, not exactly diagonal, actually half diagonal?
(Based on my reading the filter stage doesn’t need to access the above-right block which simplifies things, but I could be wrong… Intraframe prediction looks scarier though and may need the different angle cut. Anyway the half diagonal order should work for the entire set of all operations…).
This breakdown should be suitable both for CPU worker-based threading (breaking the batches into smaller sub-batches per core) and for WebGL shader based GPU work (where each large batch would issue several draw calls, each processing up to the full batch size of macroblocks).
For CPU work there could also be pipelining between the data stages, though if doing full batching that may not be necessary.
Data locality and latency
On modern computers, accessing memory is often the slowest part of a naively written calculation. If the memory you’re working with isn’t in cache, fetching it can be verrrry slow. And if you need to transfer data between the CPU and GPU things get even nastier.
Unpacking the entropy-coded data structures for an entire frames worth of macroblocks then processing them in a different order sounds like it might be expensive. But it shouldn’t be insanely so; most processing will be local to each block.
For GPU usage, data also flows mostly one way from the CPU into textures and arrays uploaded into the GPU, where random texel access should be fast. We shouldn’t need to read any of that data back to the CPU until the end of the loop filter, when we read the YUV buffers back to feed into the next frame’s predictions and output to the player.
What we do need for the GPU is to read the results of each batch’s computations back into the next batch. As long as I can render into a texture and use that texture as a source for the next call I think that all works as I need it.
Loop filter madness
The loop filter stage reduces artifacting at the edges of macroblocks and sub-blocks from motion prediction and DCT fun. Because it’s very precisely specced, and the output feeds back into the next frame’s predictions, it’s important to get this right.
per spec, each macroblock may apply up to four filter passes:
subblock left edges
subblock top edges
Along each edge, for each pixel along the edge the boundaries are tested for a threshold and a filter may or may not be applied, variously to one, two, or three pixels deep from the edge. For the macroblock edges, when the filter applies it also modifies the adjacent block data!
It’s all pretty funky, but it looks parallelizable within limits.
Ah, now to find time to research this further while still getting other stuff done. 😉