Optimizing for AVX2

How we target CPUs to maximize speed across the software in our stack.

Thomas Daede
Vimeo Engineering Blog

--

Video engineer at Vimeo here. I previously worked on the AV1 video standard and started the rav1e open-source video encoder. As someone who specializes in the low-level details of video compression, I’m responsible for making sure Vimeo’s encoder software makes the most of our resources.

At Vimeo, our transcoder is based around many pieces of software that take advantage of special CPU instruction set extensions that have been added to each newer generation of x86_64 CPUs. These include SSE4, AVX, and AVX2. Utilizing these instructions enables video and audio codecs to operate on more than one pixel at a time (or more than one audio sample), greatly increasing speed.

Because support for these instruction sets in x86_64 is optional, most software has to be built in one of three ways: one, to match the minimum CPU of all possible users; two, to include multiple copies of the program built for a few different baselines of support; or three, to choose different code paths dynamically at runtime. Because simply assuming a modern CPU can cut out a lot of users (CPUs sold as recently as 2020 didn’t include AVX2), and building multiple copies multiplies build time and program size, most video software tends to use the dynamic method. However, at Vimeo, we know ahead of time what exact CPUs our virtual machine images will run on and can target them appropriately. Different software in our stack can take advantage of this in various ways.

Just keep in mind that, when it comes to measuring CPU time accurately, modern CPUs have many components that can be affected by the state of other parts of the computer or OS. This means that measuring small changes accurately isn’t possible by simply measuring a single run. Fortunately, a tool called hyperfine can help. It runs the target program multiple times and accumulates statistics for a more accurate picture. For each of the following problems, the tests are run with hyperfine, and the included plot_whisker.py script is used to create the graphs.

Optimizing for rav1e

rav1e is the open-source AV1 video encoder used by Vimeo. It is written in Rust, and has dynamic CPU detection to call optimized functions written in intrinsics or assembly. There is no mechanism to remove the dynamic CPU detection for the assembly, so the jump tables used to call particular functions can’t be optimized away.

rav1e also makes heavy usage of Rust’s iterators, which can be autovectorized to use more powerful instruction sets. In this case, though, there is no dynamic CPU detection — the compiler can produce only one version of each function, and that version is limited to the lowest CPU support level.

The method to enable a target-specific build is specified in the Makefile, such as for targeting a modern Intel or AMD CPU:

RUSTFLAGS="-C target-feature=+avx2,+fma" cargo build --release

Alternatively, a target CPU can be specified:

RUSTFLAGS="-C target-cpu=znver3" cargo build --release

This additionally informs the optimizer about characteristics of that particular CPU. For example, consider the following function:

This shifts a 32-bit value right, with round-to-nearest (for example, when taking a fixed-point value and rounding it to the nearest integer). When we call it as follows:

the compiler generates the following assembly:

Each instruction here operates on four 32-bit values at once. The compiler also decided to unroll the loop four times, so each loop iteration actually consumes 16 values. Watch what happens when we instead target a CPU with AVX2:

It looks nearly the same, but notice the registers are now called ymm rather than xmm. This means that we’re now operating on eight 32-bit values at once, and consuming 32 values per loop iteration. This means that this function is twice as fast!

Sometimes it takes more effort to make sure a function autovectorizes correctly. The examples above came from a pull request to improve that particular function.

Testing on Zen 1 (a CPU without fast AVX2) shows a large improvement (see Figure 1).

Figure 1. This graph shows multiple percentage-point decreases in runtime for enabling AVX2, and a similar benefit to additionally targeting the znver3 architecture.

This nets about a 5 percent runtime decrease. This might not sound like much, but it’s quite good for not having to write a single line of code. It also potentially enables you to use AV1 for more videos, resulting in a better experience for the user.

Optimizing for Opus

In the case of libopus, the build system will normally output the contents of Figure 2 on x86_64.

Figure 2. Output of Opus build system with default options, including runtime CPU detection.

This means that libopus is building code using all of these instruction set extensions, but it must also check for them at runtime and use jump tables to call the appropriate routines.

To target a particular architecture, it’s as easy as specifying it in CFLAGS:

CFLAGS="-g -O2 -march=znver3" ../configure

See Figure 3 for the output.

Figure 3. Output of Opus build system with CPU architecture specified, obviating the need for runtime CPU detection.

Be careful to include the default -O2 CFLAGS. Otherwise you’ll end up with an unoptimized build!

libopus’s configure scripts are smart enough to determine that, because this targeted CPU supports all the instruction sets, it doesn’t need any runtime CPU detection at all. Figure 4 compares a default build, optimized for our CPU, and with the instruction sets enabled but otherwise optimized for a generic CPU:

Figure 4. This graph shows a very small performance loss for libopus when enabling AVX2, and a very small performance win when also enabling znver3 optimizations.

The resulting change is very small but measurable — about 1 percent. Interestingly, the binary targeting AVX2 but without the machine-specific targeting performs worse — something to investigate in the future.

Optimizing for x264

Unlike the previous two codebases, x264 doesn’t have any sort of way to bypass the CPU indirection, and additionally disables autovectorization due to a long history of bugs both in compilers and in x264 itself; autovectorization plays especially poorly with undefined behavior in C. In addition, as a mature codebase, most performance-critical code has been written in assembly. Therefore, a speed-up isn’t expected, and indeed, there is no real difference between x264’s defaults, enabling the extra instruction sets, and tuning for the particular CPU (see Figure 5).

Figure 5: This graph shows approximately equal performance for x264 regardless of the targeted CPU.

In summary

Video and audio compression is heavily reliant on the SIMD instruction sets provided by modern CPUs. Most libraries take advantage of this automatically, but it’s very important to verify that this is the case. In addition, some encoders can take additional advantage if they know the CPU at compile time — this can provide a free performance win when the target CPU is known ahead of time. In particular, modern languages can increasingly take advantage of autovectorization, which saves development time but is reliant on compile-time settings. Testing this is easy with hyperfine, and an easy way to gain extra performance and reduce costs.

Care to join us?

Check out Jobs at Vimeo for our current offerings.

--

--