August 2018 – brionv

Parallelizing PNG compression, part 3: what’s in a PNG anyway?

While I’m waiting to figure out how to build and test my patch to GNOME Shell to speed up screenshotting, let’s look at the other problem — speeding up the compression of large PNG images on multicore systems.

What’s in a PNG anyway?

The PNG image format is composed of a series of “tagged chunks”, sections of a file with a type tag, a length, and a data payload. The only one that we care about here is the IDAT chunk, which contains the compressed image data.

From the perspective of the compressor, there are a few operations that need to be done, in a pipeline:

Pack input pixels into the correct format.
Optionally apply one of 5 filters which can improve the compressibility of the bytes.
Send the filter ID and the bytes into the ‘deflate’ compressor.
Write compressed bytes to output as one or more IDAT chunks.

There’s no particular benefit to using multiple IDAT chunks unless you’re aggressively streaming output (in which case it adds a few bytes for each chunk boundary).

libpng’s pixel format transformations are surprisingly expensive in the profiling run I did (>10% of samples!) and this step can probably be improved in its own right. I haven’t yet looked into that.

Filters are applied byte-wise on the packed pixel data, and are described in detail in the PNG spec. They require the previous line as input (or a 0’d out line for the first line!). Five different filter modes are defined, and encoders are recommended to use heuristics to guess which mode will compress best (say, by trying them all). Haven’t yet dived into the libpng source for the algo used, but profiling shows the filter application step totaling about the same CPU time as the deflate compression step (~27% of CPU samples)

The filter mode byte and the filtered data are then pumped into the deflate compressor, implemented by the standard zlib library.

So what’s in a deflate?

Now we’re at the fun part. That filtering stuff sounded really easy to parallelize by breaking up the input into line-wise chunks, but what about the compression?

A compressed deflate stream consists of a 2-byte header followed by a series of blocks, either uncompressed or compressed. Uncompressed blocks have a count of bytes to copy (with a maximum size), and compressed blocks are composed of Huffman-coded symbol lists. Oh and for fun, compressed blocks do not have to be aligned on byte boundaries. They’re actually bit streams! And they have no maximum size… Block boundaries depend on the input data and how well the compression identifies redundancy. Then at the end there’s a checksum of input data.

The easiest way to get parallelism is to divide up the input data (the filtered pixels) into chunks of some size that’s large enough not to affect compression much but not so large that you don’t have enough chunks to fill the CPU cores. Run each set separately, and stitch the resulting output back together fixing up the headers and checksums.

By “easy” I meant “slightly tricky”, but not totally impossible. :)

A few tricks to this:

To ensure each chunk’s output ends on a byte boundary, use Z_SYNC_FLUSH (or Z_FINISH on the last chunk) option to deflate. This will if necessary add a 0-byte uncompressed block to output, which resyncs to a byte boundary.
Blocks can refer to prior input data in a “sliding window” of up to 32KiB to gain a little extra compression across blocks. This can be handled by using deflateSetDictionary to pass the previous 32KiB of input as the dictionary. (But be warned, that changes the header bytes you have to skip over!)
The adler32 checksums of each chunk’s input can be re-combined using adler32_combine, or could be calculated separately from the deflate call.

All this stuff is done by the excellent standalone tool pigz but it’s not available as a library (yet!).

Parallelizing that PNG

The one example of someone attempting to do this specifically in PNG that I’ve found is png-parallel, an experiment which seems to have been abandoned some time ago. I made a fork and fixed it up to build on Fedora 28 and more or less work, and it looks promising so far.

The actual code is a bit hacky, and uses OpenMP for threading which is probably a dependency nightmare, but with a few tweaks I got it building, and producing valid PNG files, and scaling pretty well up to 8 threads or beyond!

There are two limitations which prevent me from showing any benchmarks yet:

the null filter is used instead of running the adaptive filter heuristics, so it both runs faster than it should and compresses less well than it should
it uses high bit depth if ImageMagick is configured for 16-bit channel depth by default, whereas most files processed will be 8 bits deep. Takes longer and compresses worse than it should.

The really hacky part is that it reaches into libpng internals. :) That may or may not be necessary with current APIs, so I’ll do some more poking at it later to see if I can make it work cleaner.

Next steps

I think it’d be useful to have a library that ties into libpng and does just the IDAT chunk, compressing it up parallel-style. It’d use standard ol’ pthreads so you’d have to link that too.

Have to read up a bit more on libpng to see how easy it is to plug in custom image writing “correctly” or if I should “just” redo the entire library. ;) In which case I might do it in rust. :P

Stay tuned for our next post, dear readers, about building and patching GNOME Shell!

Parallelizing PNG compression, part 2: bilingual GNOME Shell

In the first part of our adventure, I regaled you with the tale of slow screenshots on a Linux GNOME desktop with the extremes of a slow-but-multicore CPU and a very high resolution display.

The problem was twofold:

GNOME Shell’s screenshot tool didn’t show audiovisual feedback until after the 7680×2160 image was saved as a PNG file, which was surprisingly slow.
Saving PNG files is single-threaded, wasting 7/8 of available CPU power.

With a solution at hand for each:

Split the screenshot into two asynchronous operations: capture and compression. Show audiovisual feedback after capture and run the effects in parallel with the compression.
Parallelize the filter and compression steps of PNG image writing across multiple threads.

Simply moving the audiovisual feedback earlier in the screenshot operation in GNOME Shell will eliminate the “feel” of the delay, solving most of my specific problem! But PNG writing can totally be parallelized too, and we’ll get to that in later blog posts. :)

GNOME Shell: How Does It Work?

GNOME Shell is a strange beast that takes a central role in the 3.x generation of the GNOME Desktop experience. It serves as a desktop compositor / window manager, a sort of intermediary for drawing all your windows, and it runs the interactive title bars, system menus, taskbar/sidebar, application selector, etc. It’s written in a combination of C using glib’s GObject system and JavaScript. And it can have plugins and stuff.

But the part we care about is its screenshot service, which is what’s triggered when we hit PrintScreen!

This is divided into two parts: a C component which uses the low-level libraries to read a portion of the screen and save it to a PNG, and a JavaScript component which exports a D-bus service wrapping a call to the C component and adding the “camera flash” effect for audiovisual feedback to the user.

Break it down

The C ScreenshotService object’s various methods (for full screen, window, area…) schedule an actual screen capture for the next paint event in the compositor, and then return control to JS, using a GTask to manage the asynchronous return callback.

On the paint event, the pixbuf is captured on the main thread (maybe 0.1s on this case), then a task thread is started to do PNG compression and writing to a file. Main thread continues on, keeping the slow PNG stuff (1.25s in this extreme!) from blocking the UI!

When the file is written out, the task calls back to the main thread and into the JavaScript callback, where it asks the service for the final filename and starts the audiovisual feedback.

Build it back up

For fun, I’ve started on a provisional patch which separates the single capture+compress operation into two separate operations. Now, the JS first calls the appropriate screenshot method to do capture.

When the callback comes back, we’ve got a capture but nothing has been compressed or written to disk — the perfect time to start modifying the screen by starting a visual flash effect! If the capture succeeded, it’s also the time to call a new method to compress & write the previously-captured image to disk.

This seems relatively straightforward to do by untangling and re-tangling some code. Everything’s already wrapped in an asynchronous GTask, and it’s mostly separating out and duplicating and de-duplicating a couple bits.

Patch is in progress, but I haven’t built or tested it yet because I’m still figuring out how to test a custom version of the latest development GNOME Shell on a desktop running a current release OS. :)

And that adventure, dear readers, will be the subject of our next post!

Parallelizing PNG compression, part 1

I’ve noticed that on my Linux workstation (an older CPU with a current graphics card and dual 4K monitors) taking screenshots by pressing “PrintScreen” in the GNOME desktop seems really slow: specifically, there is a long delay of a second or more between tapping the key and the start of the audiovisual “camera flash” feedback.

Profiling with the “perf” tool and checking the actual source for gnome-shell confirmed three things:

gnome-shell doesn’t start the camera flash effect until after the screenshot is saved to disk as a PNG file, meaning the delay is mostly up to the PNG image compression.
PNG compression is slow, and single-threaded! It takes about 1.25s to save a 7680×2160 desktop screenshot PNG on this machine; 0.75s on my faster laptop. A top of the line modern workstation can probably hit 0.5s, but that’s still a longish delay for feedback of a keyboard event.
The two biggest contributors to CPU time are filtering the output pixels before compression (part of libpng), and the actual “deflate” compression (part of zlib).

I’ve divided the problem up into two areas to work on, one of which looks easy and one of which looks tractable but is a little more work to do right:

Patch gnome-shell to split the screenshot provider into separate “capture” and “write” steps, then start the camera flash effect after capture to run in parallel with the compression. On most machines the screenshot will already be compressed before the visual effect finishes, making it seem instantaneous! On corner cases like mine, it’ll still feel fast, but the PNG may not be available until after the flash finishes.
Create a library to write PNGs using multithreading [for the actual filter/compression stages] (either as a helper for libpng, or standalone), and figure out how to get it used in various places. (My screenshot case would be gdk-pixbuf.)

The patch for gnome-shell is mostly written but I haven’t yet figured out how to build and test the latest gnome-shell on Fedora 28. :) Will poke at that in a bit.

For parallelizing the actual compression, I found a number of tools to do multithreaded gzip/deflate wrapping around regular zlib, and one old half-finished experiment at specifically doing so for PNG saving. Learn more about them in our upcoming blog post part 2! (ooh! exciting!)