Taking Advantage of Auto-Vectorization in Rust

Recently on a project I wrote some audio processing code in Rust. In the past I’ve used C++ to write audio processing code for situations where performance was critical. I wanted to take that C++ optimisation experience and see what is possible using Rust.

We’re going to take one small piece of audio processing code and take a look at how we can optimize it in Rust. Along the way we’re going to learn about optimisation using Single Instruction Multiple Data CPU instructions, how to quickly check the assembler output of the compiler, and simple changes we can make to our Rust code to produce faster programs.

Mixing Mono to Stereo

The function we’re going to optimize is taking a mono audio signal, represented as a vector of floating point values, and mixing it into a stereo audio signal containing left and right samples interleaved.

The simple Rust function takes two slices, a source mono signal and a mutable destination stereo signal

pub fn mix_mono_to_stereo(dst: &mut [f32], src: &[f32], gain_l: f32, gain_r: f32) {
    for i in 0..src.len() {
        dst[i * 2 + 0] = src[i] * gain_l;
        dst[i * 2 + 1] = src[i] * gain_r;
    }
}

The processing is very simple, but audio signals typically contain 48000 samples per second. So we are going to want this loop to be able to run as quickly as possible.

If we put this version of our function into a benchmarking tool we can get idea of how long it takes.

  Average time to process 100,000 samples
Initial Version 77.67 us

At this point we don’t know if we can make it faster, but we have our starting number.

SIMD for Optimization

One of the tools available when optimizing are Single Instruction Multiple Data (SIMD) CPU instructions. They differ from typical CPU instructions by the fact they are able to operate on four values at once, in a much shorter time than it takes to run the corresponding scalar instruction four times.

Each sample we process in our mixing loop is operated on in an identical way to all other samples. This makes our mixing loop a perfect candidate for optimization using SIMD instructions. We will be able to process four samples at once for the same CPU cost as a single sample.

One way to make sure our loop is implemented using SIMD intructions is to write it by hand using instrinsic functions. These are simple functions that map directly to CPU instructions, and are a way of guaranteeing what instructions will be generated by the compiler.

Rust includes a crate core::arch::x86_64 that provides functions the will be familiar to anyone who’s worked with intrinsics in C. All the intrinsic functions are marked unsafe as they work outside the memory safety provided by default in Rust. As the name of the crate suggests these intrinsics functions map to instructions for x86_64 CPU’s. Code that uses them will not compile on other platforms.

pub fn mix_mono_to_stereo_intrinsics_rust(dst: &mut [f32], src: &[f32], gain_l: f32, gain_r: f32) {
    assert_eq!(src.len() % 4, 0);
    assert_eq!(dst.len(), src.len() * 2);
    unsafe {
        let src_ptr = src.as_ptr();
        let dst_ptr = dst.as_mut_ptr();

        // create SIMD variables with each element set to the same value
        // mul_l = |  gain_l |  gain_l |  gain_l |  gain_l |
        // mul_r = |  gain_r |  gain_r |  gain_r |  gain_r |
        let mul_l = _mm_set1_ps(gain_l);
        let mul_r = _mm_set1_ps(gain_r);

        // process the source samples in blocks of four
        let mut i = 0;
        while i < src.len() {
            // load 4 of our source samples
            // input = | src(i + 0) | src(i + 1) | src(i + 2) | src(i + 3) |
            let input = _mm_loadu_ps(src_ptr.add(i));

            // multiply each of the four input values by the left and right volumes
            // we now have two variables containing four output values is each
            // out_l = | src(i + 0) * gain_l | src(i + 1) * gain_l | src(i + 2) * gain_l | src(i + 3) * gain_l |
            // out_r = | src(i + 0) * gain_r | src(i + 1) * gain_r | src(i + 2) * gain_r | src(i + 3) * gain_r |
            let out_l = _mm_mul_ps(input, mul_l);
            let out_r = _mm_mul_ps(input, mul_r);

            // re-arrange the output values so that each left-right pair is next to each other
            // out_lo = | src(i + 0) * gain_l | src(i + 0) * gain_r | src(i + 1) * gain_l | src(i + 1) * gain_r |
            // out_hi = | src(i + 2) * gain_l | src(i + 2) * gain_r | src(i + 3) * gain_l | src(i + 3) * gain_r |
            let out_lo = _mm_unpacklo_ps(out_l, out_r);
            let out_hi = _mm_unpackhi_ps(out_l, out_r);

            // write the four output samples (8 f32 values) to our destination memory
            _mm_storeu_ps(dst_ptr.add(2 * i + 0), out_lo);
            _mm_storeu_ps(dst_ptr.add(2 * i + 4), out_hi);

            i += 4;
        }
    }
}

If we put this version into our benchmarking tests we can see the performance improvement.

  Average time to process 100,000 samples
Initial Version 77.67 us
Rust Intrinsics 25.781 us

We’ve gotten a 3x performance improvement. But there are downsides. We’ve added a lot more complexity to our function, introduced unsafe code, and our performance gains are limited to x86_64 platforms.

Is there another option for taking advantage of the performance gains from SIMD?

Auto-Vectorization

Auto-Vectorization refers to the compiler being able to take a loop, and generate code that uses SIMD instructions to process multiple iterations of the loop at once.

Not every loop is able to be vectorized. There may not be a way to express the code in the loop using the available SIMD instructions on the target CPU. Also the compiler has to be able to prove that the SIMD version has exactly the same behavior as the scalar version. In Rust this includes obeying all the type and memory safety requirements. However those same type and memory requirements of Rust can also aid the compiler in being able to auto-vectorize.

First Attempt at Taking Advantage of Auto-Vectorization in Rust

Our first attempt at auto-vectorization is to take our simple safe Rust function

pub fn mix_mono_to_stereo(dst: &mut [f32], src: &[f32], gain_l: f32, gain_r: f32) {
    for i in 0..src.len() {
        dst[i * 2 + 0] = src[i] * gain_l;
        dst[i * 2 + 1] = src[i] * gain_r;
    }
}

we can then use the very helpful Godbolt compiler explorer site to preview what the assembler output of this Rust code is.

(This is generated from the 1.43 compiler with the -O argument to turn on optimizations)

example::mix_mono_to_stereo_1:
        push    rax
        test    rcx, rcx
        je      .LBB0_5
        mov     r8, rsi
        xor     esi, esi
.LBB0_2:
        cmp     rsi, r8
        jae     .LBB0_6
        movss   xmm2, dword ptr [rdx + 2*rsi]
        movaps  xmm3, xmm2
        mulss   xmm3, xmm0
        movss   dword ptr [rdi + 4*rsi], xmm3
        lea     rax, [rsi + 1]
        cmp     rax, r8
        jae     .LBB0_8
        mulss   xmm2, xmm1
        movss   dword ptr [rdi + 4*rsi + 4], xmm2
        add     rsi, 2
        add     rcx, -1
        jne     .LBB0_2
.LBB0_5:
        pop     rax
        ret
.LBB0_6:
        lea     rdi, [rip + .L__unnamed_1]
        jmp     .LBB0_7
.LBB0_8:
        lea     rdi, [rip + .L__unnamed_2]
        mov     rsi, rax
.LBB0_7:
        mov     rdx, r8
        call    qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
        ud2

We don’t have to be know what every CPU instructions does or what value is contained in every register to be able to get an idea of how the compiler has turned our Rust code into machine code.

Looking through the disassembly we can see six sections, marked by the labels example::mix_mono_to_stereo_1, .LBB0_2, and .LBB0_5, to .LBB0_8.

The first section has a label with the name of our function, and is called the function prelude. It contains code to deal with our function arguments and setting up the stack. It also contains a check that if the length of the src slice is zero we can exit the function straight away.

.LBBO_2 is the contents of our loop. You can see the last two instructions are a cmp which is a comparison of two number, followed by jb which is a jump to the label .LBB0_2 based on the results of the comparison. This is how we can spot a loop in assembly.

If we examine the rest of the instructions in the .LBBO_2 block we can see instructions that show the compiler has only generated code to process a single input value each iteration of the loop. In particular we can see the instruction mulss which is short for Multiply Scalar Single-precision-float. When we wrote our SIMD version by hand we use the _mm_mul_ps instrinsics which generates the mulps instruction which is short for Multiply Packed Single-precision-float. As a naming convention SSE instructions that operate all four elements at once end in ps and those that only operate on the first element end in ss.

There is some more code in the loop body that gives us a hint about is stopping the compiler from generating code to process four values at once. On lines 9 and 16 we see a cmp followed by a jae which is a another jump instructions. The labels for the jump are .LBB0_6 and .LBB0_8 which we haven’t looked at yet. If we look there, along with label .LBB0_7 which contains a call to the function core::panicking::panic_bounds_check we can determine that this is code to panic if we write outside the bounds of our destination slice.

What’s stopping the compiler from vectorizing our loop is that it wants to check the slice bounds on every single write to the destination slice. So to do that it has to process only a single input value each iteration.

But we didn’t see any bounds checks in the input slice. That’s because in the Rust code our loop uses the range i in 0..src.len(). With that loop statement the compiler is able to known that we never read out of the bounds of our loop slice.

Second Attempt

Can we prove to compiler that all our writes to the destination slice are within bounds? Our second attempt looks like

pub fn mix_mono_to_stereo_2(dst: &mut [f32], src: &[f32], gain_l: f32, gain_r: f32) {
    let dst_known_bounds = &mut dst[0..src.len() * 2];
    for i in 0..src.len() {
        dst_known_bounds[i * 2 + 0] = src[i] * gain_l;
        dst_known_bounds[i * 2 + 1] = src[i] * gain_r;
    }
}

Here we create a new slice with bounds that are known to the compiler when it’s processing this function. Hopefully it will be able to prove based on this new slice being exactly twice as long as the source slice, no access using the index i * 2 + 0 or i * 2 + 1 could ever be out of bounds.

Unfortunately if we check the assembly output of the compiler it looks very similar to our first attempt. The loop body still uses the mulss to perform a single multiply at a time, and there is still a check on the bounds for each access of the destination slice.

Because the compiler has produced almost the same assembly we won’t benchmark this function.

Third Attempt

The way we index the destination slice must be too complicated for the compiler to prove that all accesses are in bounds.

It’s worth taking a step back and thinking about what the destination indexing code is expressing: The destination slice is twice as long as the input slice, where even indexed value represent the left channel of the signal, and odd indexed values represent the right channel.

That’s a lot of implicit assumptions about the structure of the slice. Let’s use the type system to make it more explicit to both the compiler and any future readers of our code who aren’t familiar with digital audio conventions.

#[repr(C)]
pub struct StereoSample {
    l: f32,
    r: f32,
}

#[repr(transparent)]
pub struct MonoSample(f32);

pub fn mix_mono_to_stereo_3(dst: &mut [StereoSample], src: &[MonoSample], gain_l: f32, gain_r: f32) {
    let dst_known_bounds = &mut dst[0..src.len()];
    for i in 0..src.len() {
        dst_known_bounds[i].l = src[i].0 * gain_l;
        dst_known_bounds[i].r = src[i].0 * gain_r;
    }
}

If we check the assembler output we can immediately see there is a lot more assembly generated. So much that I won’t break it all down. But scanning through we can see there are blocks containing the mulps, unpcklps and unpckhps instructions we used in the hand written SIMD version. This shows that the compiler has been able to vectorize our loop to process four samples at a time using SIMD.

If we examine the structure of the assembler blocks we can get an rough idea of what it has actually generated. It goes beyond our simple hand written version, and includes multiple stages for processing all samples.

The stages go in order:

  1. If the number of samples to process is odd, process a single sample
  2. Process two samples at time until the number of samples remaining is a multiple of four
  3. Process sixteen samples at time using loop unrolling and SIMD until the number of samples remaining is less than sixteen
  4. Process four samples at time using SIMD until no samples remain

If we add this version to our benchmarks we get

  Average time to process 100,000 samples
Initial Version 77.67 us
Rust Intrinsics 25.781 us
Rust Auto-Vectorization 25.535 us

So we’ve successfully been able to match the performance of the hand written intrinsics with simple Rust code.

We can also check the assembler output for an AArch64 target and see that we’re able to also produce code using Arm’s SIMD instruction. This is another advantage over using hand written SIMD which must be re-written for each target architecture.

Conclusion

By taking advantage of the Rust type system and slice bounds safety we are able to write simple code that produces optimized SIMD output equal to that written by hand.

Addendum 17-05-2020

We started the article with some non-idiomatic Rust code. It’s something more typical of C or C++, which is the way my brain still thinks about solving problems after many years of working in those languages.

However it was a good space to stay while we explored the effect that different code changes had on the assembler generated by the compiler. The C language’s origins as a “portable assembly” means that it’s easy to see the relationship between the input and the output of a compiler.

If we take our understanding of what allows the compiler to autovectorise the code we can ask the following question. Iterators never generate out-of-bounds accesses to a slice, so if we use them to write the loop will we get auto-vectorization?

The first version uses the chunks_exact_mut function to combine the left and right samples so they can be iterated in step with the source samples.

pub fn mix_mono_to_stereo_iter_1(dst: &mut [f32], src: &[f32], gain_l: f32, gain_r: f32) {
    for (dst_sample, src_sample) in dst.chunks_exact_mut(2).zip(src.iter()) {
         dst_sample[0] = src_sample * gain_l;
         dst_sample[1] = src_sample * gain_r;
    }
}

The assembler output shows vectorization, but the loop hasn’t been unrolled as aggressivly and at most only processes eight samples per iteration.

If we re-use our StereoSample and MonoSample types the code doesn’t need to worry about chunks.

pub fn mix_mono_to_stereo_iter_2( dst: &mut [StereoSample], src: &[MonoSample], gain_l: f32, gain_r: f32, ) {
    for (dst_sample, src_sample) in dst.iter_mut().zip(src.iter()) {
        dst_sample.l = src_sample.0 * gain_l;
        dst_sample.r = src_sample.0 * gain_r;
    }
}

The assembler output is same as our earlier third attempt in the vectorization and unrolling.

Both versions have similar benchmark results to our earlier final result.