Parallelizing PNG, part 6: Rust slices and loop optimization

Looking at interesting patterns and bottlenecks I discover working on a multithreaded PNG encoder in Rust…

A common pattern in low-level code is passing around references to data buffers that are owned somewhere higher up the call chain. In C you either send a pointer and a length as a pair of parameters, or you send a pointer and have a convention like NULL-termination, and then you carefully read and write only within that region…

Take a slice

In Rust, you use a data type called a “slice”, which is as pointer+length pair but with safer semantics and some really nice syntactic sugar. :) A slice can be … sliced … out of any contiguous structure like a fixed-size array, a resizable vector, or another slice.

There’s two big safety improvements over C pointers:

Rust’s compile-time borrow checker system ensures that only one piece of code holds a mutable reference to the underlying data, so you can’t have an invalid reference. (Imagine having a slice into a vector, and then the vector resizes due to an append! Can’t happen.)
Access via index (my_slice[i]) is bounds-checked at runtime. Like dereferencing a null pointer in C, it will probably* kill your process to access a slice out of bounds.

[Update: *Dereferencing null in C is “undefined behavior” and sometimes doesn’t crash, depending on the system and whether you’ve installed signal handlers. Rust’s “panics” are more defined in how they behave, and can in some cases be caught and recovered from. But by default, either is bad for you if you don’t handle it! ;)]

“But wait!” I hear you say. “Bounds checks at runtime are sslloowww!” Well there’s ways around that. :)

Bounds checkin’

So what is indexing into an array anyway? The index value is a “usize” (unsigned integer, pointer-sized) which is behind the scenes added to the underlying pointer to produce a final pointer. So we can think of my_slice[i] = x as doing something like this behind the scenes:

*(my_slice.as_ptr() + i) = x;

With the bounds check, it looks something like:

if i < my_slice.len() {
    *(my_slice.as_ptr() + i) = x;
} else {
    panic!("Out of bounds");
}

Note that you don’t need to check for i >= 0 because it’s an unsigned type!

But what about in a loop? Won’t that check slow tight loops down?

for i in 0 .. my_slice.len() {
    if i < my_slice.len() {
        *(my_slice.as_ptr() + i) = x;
    } else {
        panic!("Out of bounds");
    }
}

That looked kind of redundant right? Isn’t the loop already checking that i < my_slice.len() on every iteration? In fact it is… And in an optimized build, the bounds check can actually be removed by the optimizer!

Don’t be afraid to let the optimizer do your work for you — the default-immutability and ownership semantics of Rust mean there’s a lot of things like that that improve dramatically in an optimized build while still retaining code that’s both straightforward to read and refactor, and performs well.

Iterators

Using a for loop with an index range isn’t always considered good style in Rust, both because of those bounds checks and because iterators are far, far more flexible since they can work with other data structures than slices.

An iterator version of that little loop would start out as:

for iter in my_slice.iter_mut() {
  *iter = x;
}

You call iter_mut() to get a mutable reference, or iter() for immutable. Each pass through the loop gives you a reference to the element which you can read or write appropriately.

For a slice that essentially compiles down to the same as the for loop with an index range, but without needing the intermediate check even in unoptimized builds.

Cheating

You can also use the “unsafe” get_unchecked and get_unchecked_mut functions to get a reference to an indexed value without the bounds check! But you have to wrap in an “unsafe” block, because Rust makes you label stuff like that. :D

for i in 0 .. my_slice.len() {
    unsafe {
        *(my_slice.get_unchecked_mut(i)) = x;
    }
}

Multiple slices and the optimizer

In mtpng I found a case where I had to use indexing instead of iterators because I was working with multiple slices in sync, which introduced several bounds checks.

I found that adding validation checks that the lengths were all the same actually made all the bounds checks disappear, doubling the speed of the tight loop and improving overall encode speed by over 25%.

Without the validation, the function looked something like this:

fn filter_iter<F>(bpp: usize, prev: &[u8], src: &[u8], out: &mut [u8], func: F)
    where F : Fn(u8, u8, u8, u8) -> u8
{
    for i in 0 .. bpp {
        let zero = 0u8;
        out[i] = func(src[i], zero, prev[i], zero);
    }
    for i in bpp .. out.len() {
        out[i] = func(src[i], src[i - bpp], prev[i], prev[i - bpp]);
    }
}

With the checks added at the top, before the inner loop:

fn filter_iter<F>(bpp: usize, prev: &[u8], src: &[u8], out: &mut [u8], func: F)
    where F : Fn(u8, u8, u8, u8) -> u8
{
    assert!(out.len() >= bpp);
    assert!(prev.len() == out.len());
    assert!(src.len() == out.len());

    for i in 0 .. bpp {
        let zero = 0u8;
        out[i] = func(src[i], zero, prev[i], zero);
    }
    for i in bpp .. out.len() {
        out[i] = func(src[i], src[i - bpp], prev[i], prev[i - bpp]);
    }
}

At runtime those extra checks at the top should never trigger, because all three slices are the same length and bpp is never larger than the length. But the optimizer didn’t know that! Making the invariant explicit in the code, instead of just hoping it was right, lets the optimizer turn all of these:

[Updated: using the assert! macro is better style than manually calling panic! in your high-level code. Note that assert! code is always present in both debug and release builds; use the debug_assert! macro for checks that aren’t necessary for safety or performance.]

for in in bpp .. out.len() {
    if i < src.len() {
        if i < prev.len() {
            if i < out.len() {
                out[i] = func(src[i], src[i - bpp], prev[i], prev[i - bpp]);
            } else {
                panic!("Out of bounds");
            }
        } else {
            panic!("Out of bounds");
        }
    } else {
        panic!("Out of bounds");
    }
}

Into this with no bounds checks:

for in in bpp .. out.len() {
    out[i] = func(src[i], src[i - bpp], prev[i], prev[i - bpp]);
}

Pretty neat right!

zip and izip!

Update: The above case can also be rewritten with iterators by “zipping” multiple iterators together.

If you only have two iterators, you can use the “zip” function in the standard library; if you have more you can use the “izip!” macro in the itertools crate.

This ends up with code that can be a bit verbose but should also run cleanly:

let len = out.len();
for (dest, cur, left, up, above_left) in
    izip!(&mut out[bpp ..],
          &src[bpp ..],
          &src[0 .. len - bpp],
          &prev[bpp ..],
          &prev[0 .. len - bpp]) {
    *dest = func(*cur, *left, *up, *above_left);
}

[Update: I was able to confirm that careful use of izip! slightly outperforms indexing plus voodoo assertions, removing another instruction or two per inner loop iteration. If you can write sanely that way, it works nicely! Won’t work if you need random access to the various slices, but for this kind of lock-step iteration it’s perfect.]

Debug vs release builds

The rust compiler and the cargo package manager default to unoptimized debug builds if you don’t tell it to make a release build.

This sounds good, except the entire Rust standard library is built on the same patterns of using safe clean code that optimizes well… For mtpng I’m seeing a 50x slowdown in runtime in unoptimized debug builds versus optimized release builds. Yeeeooooowwwwch!

Note that you can change the optimization level for your own debug builds in the Cargo.toml file in your project, which can help; you can crank it all the way up to a release build’s optimization status or leave it somewhere in the middle.