Parallelizing PNG compression, part 3: what’s in a PNG anyway?

While I’m waiting to figure out how to build and test my patch to GNOME Shell to speed up screenshotting, let’s look at the other problem — speeding up the compression of large PNG images on multicore systems.

What’s in a PNG anyway?

The PNG image format is composed of a series of “tagged chunks”, sections of a file with a type tag, a length, and a data payload. The only one that we care about here is the IDAT chunk, which contains the compressed image data.

From the perspective of the compressor, there are a few operations that need to be done, in a pipeline:

Pack input pixels into the correct format.
Optionally apply one of 5 filters which can improve the compressibility of the bytes.
Send the filter ID and the bytes into the ‘deflate’ compressor.
Write compressed bytes to output as one or more IDAT chunks.

There’s no particular benefit to using multiple IDAT chunks unless you’re aggressively streaming output (in which case it adds a few bytes for each chunk boundary).

libpng’s pixel format transformations are surprisingly expensive in the profiling run I did (>10% of samples!) and this step can probably be improved in its own right. I haven’t yet looked into that.

Filters are applied byte-wise on the packed pixel data, and are described in detail in the PNG spec. They require the previous line as input (or a 0’d out line for the first line!). Five different filter modes are defined, and encoders are recommended to use heuristics to guess which mode will compress best (say, by trying them all). Haven’t yet dived into the libpng source for the algo used, but profiling shows the filter application step totaling about the same CPU time as the deflate compression step (~27% of CPU samples)

The filter mode byte and the filtered data are then pumped into the deflate compressor, implemented by the standard zlib library.

So what’s in a deflate?

Now we’re at the fun part. That filtering stuff sounded really easy to parallelize by breaking up the input into line-wise chunks, but what about the compression?

A compressed deflate stream consists of a 2-byte header followed by a series of blocks, either uncompressed or compressed. Uncompressed blocks have a count of bytes to copy (with a maximum size), and compressed blocks are composed of Huffman-coded symbol lists. Oh and for fun, compressed blocks do not have to be aligned on byte boundaries. They’re actually bit streams! And they have no maximum size… Block boundaries depend on the input data and how well the compression identifies redundancy. Then at the end there’s a checksum of input data.

The easiest way to get parallelism is to divide up the input data (the filtered pixels) into chunks of some size that’s large enough not to affect compression much but not so large that you don’t have enough chunks to fill the CPU cores. Run each set separately, and stitch the resulting output back together fixing up the headers and checksums.

By “easy” I meant “slightly tricky”, but not totally impossible. :)

A few tricks to this:

To ensure each chunk’s output ends on a byte boundary, use Z_SYNC_FLUSH (or Z_FINISH on the last chunk) option to deflate. This will if necessary add a 0-byte uncompressed block to output, which resyncs to a byte boundary.
Blocks can refer to prior input data in a “sliding window” of up to 32KiB to gain a little extra compression across blocks. This can be handled by using deflateSetDictionary to pass the previous 32KiB of input as the dictionary. (But be warned, that changes the header bytes you have to skip over!)
The adler32 checksums of each chunk’s input can be re-combined using adler32_combine, or could be calculated separately from the deflate call.

All this stuff is done by the excellent standalone tool pigz but it’s not available as a library (yet!).

Parallelizing that PNG

The one example of someone attempting to do this specifically in PNG that I’ve found is png-parallel, an experiment which seems to have been abandoned some time ago. I made a fork and fixed it up to build on Fedora 28 and more or less work, and it looks promising so far.

The actual code is a bit hacky, and uses OpenMP for threading which is probably a dependency nightmare, but with a few tweaks I got it building, and producing valid PNG files, and scaling pretty well up to 8 threads or beyond!

There are two limitations which prevent me from showing any benchmarks yet:

the null filter is used instead of running the adaptive filter heuristics, so it both runs faster than it should and compresses less well than it should
it uses high bit depth if ImageMagick is configured for 16-bit channel depth by default, whereas most files processed will be 8 bits deep. Takes longer and compresses worse than it should.

The really hacky part is that it reaches into libpng internals. :) That may or may not be necessary with current APIs, so I’ll do some more poking at it later to see if I can make it work cleaner.

Next steps

I think it’d be useful to have a library that ties into libpng and does just the IDAT chunk, compressing it up parallel-style. It’d use standard ol’ pthreads so you’d have to link that too.

Have to read up a bit more on libpng to see how easy it is to plug in custom image writing “correctly” or if I should “just” redo the entire library. ;) In which case I might do it in rust. :P

Stay tuned for our next post, dear readers, about building and patching GNOME Shell!