Parallelizing PNG, part 8: Rust macros for constant specialization

In my last posts I covered profiling and some tips for optimizing inner loops in Rust code while working on a multithreaded PNG encoder. Rust’s macro system is another powerful tool for simplifying your code, and sometimes awesomeizing your performance…

Rust has a nice system of generics, where types and functions can be specialized based on type and lifetime parameters. For instance, a Vec<u8> and a Vec<u32> both use the same source code describing the Vec structure and all its functions, but at compile time any actual Vec<T> variants get compiled separately, with as efficient a codebase as possible. So this gives a lot of specialization-based performance already, and should be your first course of action for most things.

Unfortunately you can’t vary generics over constant values, like integers, which turns out to sometimes be something you really wish you could have!

In mtpng, the PNG image filtering code needs to iterate through rows of bytes and refer back to bytes from a previous pixel. This requires an offset that’s based on the color type and bit depth. From our last refactoring of the main inner loop it looked like this, using Rust’s iterator system:

let len = out.len();
for (dest, cur, left, up, above_left) in
    izip!(&mut out[bpp ..],
          &src[bpp ..],
          &src[0 .. len - bpp],
          &prev[bpp ..],
          &prev[0 .. len - bpp]) {
    *dest = func(*cur, *left, *up, *above_left);

When bpp is a variable argument to the function containing this loop, everything works fine — but total runtime is a smidge lower if I replace it with a constant value.

Nota bene: In fact, I found that the improvements from this macro hack got smaller and smaller as I made other optimizations, to the point that it’s now saving only a single instruction per loop iteration. :) But for times when it makes a bigger difference, keep reading! Always profile your code to find out what’s actually slower or faster!

Specializing macros

The filter functions look something like this, without the specialization goodies I added:

fn filter_paeth(bpp: usize, prev: &[u8], src: &[u8], dest: &mut [u8]) {
    dest[0] = Filter::Paeth as u8;

    filter_iter(bpp, &prev, &src, &mut dest[1 ..],
	            |val, left, above, upper_left| -> u8
        val.wrapping_sub(paeth_predictor(left, above, upper_left))

The filter_iter function is the wrapper func from previous post; it runs the inner loop and calls the closure with the actual filter (inlining the function call out to make it zippy in release builds). Rust + LLVM do a great job of optimizing things up already, and this is quite fast — especially since moving to iterators.

But if we need to specialize on something we can’t express as a type constraint on the function definition… macros are your friend!

The macro-using version of the function looks very similar, with one addition:

fn filter_paeth(bpp: usize, prev: &[u8], src: &[u8], dest: &mut [u8]) {
    filter_specialize!(bpp, |bpp| {
        dest[0] = Filter::Paeth as u8;

        filter_iter(bpp, &prev, &src, &mut dest[1 ..],
                    |val, left, above, upper_left| -> u8
            val.wrapping_sub(paeth_predictor(left, above, upper_left))

The “!” on “filter_specialize!” indicates it’s a macro, not a regular function, and makes for a lot of fun! commentary! about! how! excited! Rust! is! like! Captain! Kirk! ;)

Rust macros can be very powerful, from simple token replacement up to and including code plugins to implement domain-specific languages… we’re doing something pretty simple, accepting a couple expressions and wrapping them up differently:

macro_rules! filter_specialize {
    ( $bpp:expr, $filter_closure:expr ) => {
            match $bpp {
                // indexed, greyscale@8
                1 => $filter_closure(1),
                // greyscale@16, greyscale+alpha@8 
                2 => $filter_closure(2),
                // truecolor@8
                3 => $filter_closure(3),
                // truecolor@8, greyscale+alpha@16
                4 => $filter_closure(4),
                // truecolor@16
                6 => $filter_closure(6),
                // truecolor+alpha@16
                8 => $filter_closure(8),
                _ => panic!("Invalid bpp, should never happen."),

The “macro_rules!” bit defines the macro using the standard magic, which lets you specify some token types to match and then a replacement token stream.

Here both $bpp and $filter_closure params are expressions — you can also take identifiers or various other token types, but here it’s easy enough.

Note that unlike C macros, you don’t have to put a “\” at the end of every line, or carefully put parentheses around your parameters so they don’t explode. Neat!

However you should be careful about repeating things. Here we could save $filter_closure in a variable and use it multiple times, but since we’re specializing inline versions of it that’s probably ok.

Note that things like match, if, and function calls can all inline constants at compile time! This means each invocation uses the exact-bpp inlined variant of the function.

Down on the assembly line

Looking at the “Paeth” image filter, which takes three byte inputs from different pixels… here’s a fragment from the top of the inner loop where it reads those pixel byte values:

movdqu (%rcx,%r9,1),%xmm1     ; read *left
movdqu (%rsi,%rbx,1),$xmm3    ; read *up
movdqu (%rsi,%r9,1),%xmm5     ; read *above_left

(Note we got our loop unrolled and vectorized by Rust’s underlying LLVM optimizer “for free” so it’s loading and operating on 16 bytes at a time here.)

Here, %rcx points to the current row and %rsi to the previous row. %rbx contains the loop iterator index, and %r9 has a copy of the index offset by -bpp (in this case -4, but the compiler doesn’t know that) to point to the previous pixel.

The version with a constant bytes-per-channel is able to use the fixed offset directly in x86’s addressing scheme:

movdqu (%rcx,%rbx,1),%xmm1    ; read *left
movdqu (%rsi,%rbx,1),%xmm5    ; read *above_left
movdqu 0x4(%rsi,%rbx,1),%xmm3 ; read *up

Here, %rbx has the previous-pixel index, and there’s no need to maintain the second indexer.

That doesn’t seem like an improvement there — it’s the same number of instructions and as far as I know it’s “free” to use a constant offset in terms of instructions per cycle. But it is faster. Why?

Well, let’s go to the end of the loop! The variable version has to increment both indexes:

add $0x10,%r9   ; %r9 += 16
add $0x10,%rbx  ; %rbx += 16
cmp %r9,%r14    ; if %r9 > len - (len % 16)
jne f30         ; then continue loop

but our constant version only has to update one.

add $0x10,%rbx ; %rbx += 16
cmp %rbx,%r8   ; if %rbx > len - (len % 16)
jne 2092       ; then continue loop

Saved one instruction in an inner loop. It’s one of the cheapest instructions you can do (adding to a register is a single cycle, IIRC), so it doesn’t save much. But it adds up on very large files to …. a few ms here and there.

The improvement was bigger earlier in the code evolution, when I was using manual indexing into the slices. :)

Macro conclusions

  • Always profile to see what’s slow, and always profile to see if your changes make a difference.
  • Use generics to vary functions and types when possible.
  • Consider macros to specialize code in ways you can’t express in the generics system, but check that the compiled output does and performs how you want!

I may end up removing the filter specialization macro since it’s such a small improvement now and it costs in code size. :) But it’s a good trick to know for when it’s helpful!

Next: interfacing Rust code with C!