VMAF FTW

Vimeo has long been known for the high level of video quality that we deliver to our viewers. But how do we verify that? Looking at videos and listening to customer feedback is helpful, but that’s anecdotal evidence that covers only a very small sample of all the videos uploaded to our platform. Wouldn’t it be great if there were some technique that could measure the quality of our videos automatically and give us a metric of how well we’re doing?

Being able to measure the quality of video objectively is such a useful tool that development of such measurements spans back to the earliest days of digital video. The current best performer Netflix has developed exactly the thing: VMAF. This metric combines several previously known metrics via a learned model to produce a composite score that outperforms its individual components. In addition, it’s one of the few metrics designed to create useful results when comparing videos encoded at multiple resolutions.

In this post, we’ll go over the basics of Vimeo’s encoding pipeline, and how we integrated VMAF computation into it in an efficient way. We’ll also look at several ways to collect and analyze the resulting data, in order to continuously improve the quality of video that Vimeo delivers.

Looking at the video encoding pipeline at Vimeo

At Vimeo, when you upload a video, we perform four basic video processing steps:

Decoding. Your video is separated from its audio and decompressed from the format that you uploaded, into raw pixel data.
Scaling. We encode your video into multiple profiles, each being a combination of resolution and other encoder settings, to adapt to the type of device and the bandwidth limitations of viewers. If needed by the particular target profile, the video is scaled.
Conversion/tonemapping. If needed, the video is converted into a color space widely supported by browsers. In addition, to support SDR devices, HDR videos may need to be tonemapped into a SDR profile at this step.
Encoding. Finally, the video is compressed into one of several possible formats, depending on the profile.

Ideally, we’d like to add a fifth step to our pipeline: computing VMAF scores. For doing the actual computation, let’s use [libvmaf](https://github.com/Netflix/vmaf/tree/master/libvmaf), an open-source library also released by Netflix. However, we also need to find the right way to integrate it into our stack.

The so-called easy way

The two inputs needed for libvmaf are the raw pixel frames of the original video and the frames of the encoded video. The most straightforward way to compute VMAF, then, is to take the original video and encoded video, decode them to raw pixels, and feed them to libvmaf. This is what is done in other tools such as the AreWeCompressedYet video codec testing framework, as shown in Figure 1.

Figure 1. A simple VMAF calculation pipeline.

However, it’s a bit more complicated than that. Remember that our encoding step also might require color space conversion, scaling, and tonemapping. We must replicate that exactly when reading the source file again, to match the encoded video. So in practice, we get a pipeline like in Figure 2.

Figure 2. A VMAF calculation pipeline that includes a conversion step.

Besides the added complexity of ensuring that our scaling and tonemapping match the version used when the video was encoded, this approach also requires fetching the source video and encoded video from cloud storage and re-decoding the source video. For these reasons, we chose another method for Vimeo.

Metrics during encode

Our solution to this problem isn’t to compute the metrics after encoding but to compute them inline during the encoding process. After all, during the normal encoding process, we’ve already fetched the source video, decoded it, and converted it. We might as well just compute metrics at this point as well! The new process looks like Figure 3.

Figure 3. Computing VMAF during encode.

Using reconstruction from the encoder

We’ve now saved an entire decoder and scaling step, but we can do better. The decode step is actually unnecessary, because modern encoders already know the raw pixel values that the decoder produces. This is because video codecs predict pixel data from previous frames. Therefore, video encoders must internally generate and store decoded frames. x264 and x265 already have APIs to fetch this data, known as the I or reconstruction, and we added an equivalent API to rav1e. This lets us skip the decoding step entirely, as shown in Figure 4.

Figure 4. Using the encoder reconstruction for VMAF computation.

Unfortunately, it’s not quite that simple. One issue is encoder lookahead. Most encoders, when given an input frame, don’t produce encoded output data right away; there’s a lag of up to around 70 frames to plan bitrate allocation. Therefore, we need to hold onto decoded source frames and match them up with reconstructed frames from the encoder, since they come out later. We do that by attaching a pointer to the decoded source frame to the encoder’s input data and extract it from the output, relying on the encoder to keep track of it for us. Figure 5 shows the addition of this queue.

Figure 5. Using a queue to handle encoder delay.

The downside of this approach is significant additional RAM usage for holding all these additional frames. The extra RAM usage is currently acceptable at Vimeo, but there are some future improvements that can be made to decrease it.

There’s something else to consider also. One quirk of libvmaf is that frames must be inserted in order. This is because VMAF has a temporal component, and for libmvaf to compute scores correctly, it keeps track of the previous frame in the order that they were inserted. Our implementation up to this point is enough for encoding with rav1e, since although AV1 encoders can reorder frames, they do so with internal codec primitives that result in the final output being in order. However, for x264 and x265, we must go a step further and add an additional reordering queue, which consumes a small amount of additional memory. Figure 6 shows the previous lookahead queue combined with a reordering queue.

Figure 6. Using a queue to handle reordering as well.

VMAF for adaptive streaming

VMAF has another trick up its sleeve that we can take advantage of. Unlike earlier video quality metrics, VMAF was designed to compare videos of different resolutions. Up to now, we’ve been computing VMAF using the scaled source video: for example, if the user uploaded 1080p but we’re encoding the 360p version, the VMAF score tells us how well our encode matches the 360p scale.

While this is incredibly useful by itself, it would also be nice to compare versions of different resolutions on the same scale, say to get an idea of how much 360p degrades quality compared to 540p. To do this, we must compare against the source resolution, not the encoded resolution. Besides including an original resolution frame pointer with the data we send to the encoder, we must also scale up the reconstruction to match before passing it to libmvaf. See Figure 7 for the resulting pipeline.

Figure 7. Scaling steps for adaptive streaming.

Note that format conversion and tonemapping are still required. Because these steps are performed by an optimized conversion pipeline (powered by zimg), it’s simplest to do them again in a separate pipeline, without the scaling. Finally, the scaling step is done to the reconstruction between the encoder and the reorder queue.

Complexities, complexities

Because of the added computational complexity of VMAF (especially when scaled to original resolution), we currently sample VMAF on a random subset of all encodes. The results are logged to a time-series database, which enables us to create Grafana dashboards to show results over time. The results are additionally logged into a BigQuery database, which allows more complex queries to be performed. Let’s take a look at some of the more complex queries below.

VMAF at encode resolution

To start with, let’s look at the results computed at the decode resolution; that is, comparing the result to the pre-scaled input. In theory, we should get about the same quality across all of our profiles no matter the resolution, and indeed we do, as Figure 8 shows.

Figure 8. VMAF scores by profile.

Pretty nifty! However, this doesn’t tell the whole story. The average is only one way to look at a dataset (see Figure 9).

Figure 9. VMAF score percentiles.

Now the profiles start to differentiate themselves a bit. We can see that some of the profiles, like 360p, 1440p, and 4K, have worse outliers than the rest. Why is this? The answer can be found in the tradeoffs we have to make for encoding.

Ideally, we’d target every profile at every resolution to be the same quality, regardless of what the video content looks like. However, in practice, some video content can take a much higher bitrate than others to look just as good. If we allowed huge variations in video bitrate, it would make the job of the bandwidth controller in the video player much more difficult. Sure, your internet is fast enough to play 720p now, but what if the next minute of video suddenly spikes to double the bitrate?

In order to keep this under control, we also impose an upper bitrate cap on each profile. When we hit that cap, video quality starts to decrease. In fact, if we look closer at the 360p profile and plot the actual bitrate versus the VMAF score (see Figure 10), we can pick out the knee where the bitrate cap is hit (at 550 kbit/s) and the VMAF score plummets.

Figure 10. The VMAF score scatterplot for the 360p profile.

While we can’t remove the bitrate cap entirely due to the aforementioned bandwidth controller, we can at least normalize it across the profiles. Figure 11 shows our updated 360p profile, with an increased bitrate cap.

Figure 11. VMAF scores before and after the profile update.

Note how only the lowest percentiles are affected. The median remains constant.

VMAF for adaptive streaming

For the adaptive streaming case, where we scale the decoded video back up to the source resolution, we don’t get similar VMAF scores for the different renditions. In fact, we’d expect a quite low VMAF score for our lowest resolution profile (240p), and then upwards from there, in a nice curve. And that’s exactly what we get (see Figure 12).

Figure 12. Scaled VMAF scores across profiles.

Note how the worst profile (240p) is quite poor and the improvements are rapid, but by the highest resolutions, the improvements are significantly smaller. This is exactly what we expect: bumping from 1440p to 4K is a much smaller improvement visually than going from 240p to 360p.

However, analyzing in this way wastes the potential of using VMAF scaled to the source resolution. The big advantage is that now, VMAF scores are comparable across different renditions of the same video. This enables many interesting new queries. For example, you can query for cases where a 720p rendition is worse than a 1080p one (see Figure 13).

Figure 13. A SQL query for finding cases where a lower resolution might outperform a higher one.

Obviously, we’d like this never to be the case. The video ladder should always increase, and, in fact, most of the time, this is the case. However, when we ran this query at Vimeo, we did find a few exceptions! Further investigation showed that this was due to some apps internally scaling from, say, 720p to 1080p before uploading to Vimeo. In these cases, there is no additional information to represent at a higher resolution; the additional bitrate helps, but not always. This issue can be corrected at the source, but there may be future work that we can do to detect such cases and adjust our transcode ladder accordingly.

In sum

Integrating objective metrics like VMAF into our production encoding system gave us a huge amount of extra insight into the performance of our transcoding. An efficient implementation gives us the ability to use as large of a sampling as possible. From simple regression testing to complex queries, it has been an incredibly useful tool.

This post only scratches the surface of what is possible with the data. Do you have any interesting ideas of your own to try with objective video quality metrics? We’d love to talk with you!