Riding the dragon

Logo for the Falkor project, a white and grey dog-dragon, stylized with macro blocks.

There’s a new sheriff down Vimeo way, if by sheriff you mean Falkor, our next-gen transcoding infrastructure (and favorite fantasy 1980s dog dragon). It’s faster and more reliable than what came before, and its cloud-native nature brings us into the future with style. I’m here to tell you all about it.

First, some history

Falkor is replacing a transcoding stack called Tron, which dates all the way back to 2013. Tron has the following characteristics:

It outputs progressive MP4 with both audio and video.
It downloads the whole source file locally and transcodes it to the desired profile and uploads the result to cloud storage.

Tron was designed for the pre-cloud Vimeo, back when we were running our own data center (although it does take advantage of Spot instances, which help us to optimize costs). These days we’re in the cloud with Google Cloud.

Even though Tron is ten years old, we haven’t completely retired it yet. It still handles some edge cases that Falkor can’t manage; more on that in a little while.

Why Falkor?

The earliest seeds of the new infrastructure can be traced to 2011, before Tron was even a thing. But we started working on it in earnest in late 2019 with several goals in mind.

To input from the original source file

This is a continuation of what we did with Tron. Using the original source file as the input of transcodes whenever possible maximizes the quality of the output. Otherwise we’re stuck with mezzanine files; these are already transcoded, and given that transcoding is a lossy process, using the mezzanine as a source for other transcodes results in worse video quality compared to using the original source directly.

To have parallelized and distributed transcoding

Parallelized and distributed transcoding cuts the video into smaller chunks that then get transcoded on our servers. When all the transcodes are done, the chunks are combined to create the final output (see Figure 1). This makes transcoding faster and more resilient to errors.

A flowchart showing what distributed transcoding looks like: take the input, chunk it, transcode each chunk in parallel, concatenate all the transcoded chunks into a final output.

Figure 1. A parallelized and distributed transcoding process.

We wanted our new infrastructure to use the much cheaper but ephemeral Spot instances that extended the lifespan of good old Tron. These instances don’t come with guaranteed capacity, but they cost us significantly less, up to two times less or more, and help us to react to quick changes in available resources. In this context, using Spot instances means that some transcoding work is canceled midway. But with parallelized and distributed transcoding, only a small part of the video needs to be retried, not the entire video.

Further, in the cloud, service providers including Google make you pay your instances by the second (after the first minute). Running one instance for an hour or ten instances for six minutes each roughly costs the same in terms of price, yet the parallel transcoding takes substantially less time.

To be cloud-native

Tron was designed for the pre-cloud Vimeo, and it sits in the cloud rather uncomfortably as built, so this one was an obvious choice. If we’re going to be in the cloud, we might as well take full advantage of what our cloud provider offers us. That means less infrastructure to manage ourselves, although we also want to avoid vendor lock-in where we can. If we rely too much on the intricacies of our current cloud vendor, it becomes harder to migrate to another cloud vendor if we ever decide to do that in the future.

To leverage spot instances as much as possible

As I mentioned before, this reduces costs without appreciably impacting the transcoding time.

To store audio and video separately and have fragmented MP4 output

Separate audio and video outputs give us easier access to the audio and video streams.

If audio is extracted and stored only once instead of getting muxed, or merged with video, with every rendition, we make our packager’s job easier and save on storage costs. (Traditionally audio and video were stored in the same file, so the audio was effectively duplicated for each video quality or rendition.)

Using fragmented video, the video is stored in a way that cutting a video file into chunks is easier, which also makes our packager’s job easier.

This does come with a tradeoff, though, in that it makes serving progressive files more difficult, since we have to mux them on the fly.

To be developed and deployed with our standard tooling

Using a similar set of tools (language, libraries, orchestration, and so on) as the team’s other services, along with the rest of the company, help us to make sure that our infrastructure is more maintainable, that we can leverage work from other services and teams, and overall reduce complexity.

Tron relies on deprecated technologies such as Python 2.

Falkor from 20,000 feet

To assist in any high-level Falkor discussion, I’ve put together a couple visuals. Figure 2 gives an overview of Falkor’s components and their interactions with other services.

High level overview of Falkor: Client send a request for a new transcode to the VideoAPI, which sends jobs to the FalkorAPI, which fans out the works to various workers (analyzers, audio/video transcoders, composers).

Figure 2. Falkor components.

Figure 3 shows the flow of a Falkor job from start to finish.

A detailed flowchart of a Falkor transcode, from the client to VideoAPI to FalkorAPI to the various workers, and then eventually back to the client when everything is done.

Figure 3. Falkor flowchart.

Here’s the Falkor transcode flow in more detail.

Step 1

A client tells our video API to transcode a video with a list of profile sets. The exact list of profile sets might depend on the use case. For example, not all videos get AV1.

Step 2

Our video API does some checks, gets the source location for the video, and asks the Falkor API to run an analysis job. This returns metadata such as duration, codecs, frame rate, whether the video is in HDR, and so on, which is put in cloud storage for reuse in subsequent transcoding jobs.

Step 3

The video API receives the metadata from the analysis job and determines which transcode audio and video profiles should be run: which resolutions, in HDR or not, and so on. Each of these profiles gets its own new Falkor API job.

Audio jobs are run on an audio transcode worker, which transcodes the source’s audio without downloading it locally and uploads the transcode to cloud storage.

Video jobs are a little more complicated. Based on the user’s uploaded source video’s index and other metadata, the Falkor API determines where a video can be split, ideally into chunks of roughly one minute in duration. If the video can’t be split, we fall back to transcoding it as a single chunk in Tron (for now; details below). Each of these chunks is transcoded in parallel by various video transcode workers, which get the required byte range from the source file for their assigned portion of the video, and then upload the result to cloud storage.

Once all the chunks have been processed, the Falkor API creates a final compose task that’s handled by a compose worker. Its job is to generate the video’s header based on the headers of the various chunks, like moov and SIDX, and then concatenate this header with all the chunks together and store the reassembled video in its final destination. With our cloud provider (more on that below), it’s just an API call to cloud storage.

Step 4

Finally, upon completion, the Falkor API tells the video API that the job is done. The video API adds the new audio or video file to our video management system and informs the client of the good news.

The notifications for each individual job let the client decide how to act based on business logic. For example, the client can allow a video to be playable as soon as one of its H.264 video component and one of its AAC audio components are ready and in hand; or the client can hold off on playback a little longer and wait for all transcodes to be done. In the meantime, it can trigger extra processing tasks, such as the generation of thumbnail images.

Let’s get technical

From a stack perspective, all of this is running in Google Cloud, on Kubernetes (GKE), in three U.S. regions. For queues, we use PubSub. Falkor itself is written in Go, and the transcoders are written in C.

Falkor also makes use of Quickset, our job scheduler, which enables us to do two things that help keep our costs down:

It fits tasks efficiently within available CPU and memory resources into the workers to minimize CPU idleness while still keeping some room on reserve for bursts.
It autoscales Kubernetes nodes and schedules tasks based on Spot-instance priority, which means that we fall back on non-Spot instances only when we really need to.

But for Quickset to distribute tasks efficiently, all the tasks need to last about the same amount of time and use about the same amount of resources. To help bring this about, we enqueue tasks on different queues. Analysis tasks are based on size, because we don’t have a better approximation. Audio tasks are based on duration and codec; because we don’t chunk audio, durations vary wildly. Video tasks are based on rendition and codec, since the durations of the video chunks are constant, at about one minute each, as I mentioned before.

How we rolled it out

Carefully, that’s how. We wanted to be able to iterate quickly, with the least amount of disruption for our users.

We started by sending only a small percentage of our H.264 240p transcodes to the new infrastructure, for the following reasons:

The rendition wasn’t exposed to our users in either the UI or the API, only to our own player or external playlists, so in case of an issue the impact would be minimal.
We could start getting traffic and tweaking the scaling without having to worry too much about user impact.
It gave us time to build and integrate a pipeline of fragmented audio and video data to recombined progressive files. To learn more about this, watch the London Video Tech talk from April 13, 2021.

We did some minor tweaking to fix a handful of bugs and sort out some scaling issues. Then, once 100 percent of 240p was going through the new infrastructure, we started sending it audio in AAC and Opus format. That gave us the chance to let it handle some traffic.

We then moved on to H.264 1080p, since this rendition enables us to verify more easily that the visual quality was what we expected. This is our most user-viewed resolution, so we’d get feedback in a hurry if something was off. We tested it over and over again internally, but whenever you’re dealing with user-generated content, there are always fun new edge cases.

After 1080p, we were confident about the new infrastructure both in terms of scale and output quality, so we gave it all the other H.264 renditions at once: 4K, 2K, 720p, 360p, and so on, with 360º videos and HEVC for HDR10 and Dolby Vision soon afterward.

Even as I write this, there are still more transcodes that we haven’t yet migrated:

Videos with a variable frame rate source. This is an edge case that we didn’t want to handle immediately, so we could more easily identify frame rate-related issues later on.
AV1. There weren’t any blockers here. We just didn’t want to do too much at once. We aren’t using AV1 except for a very small number of videos — our Staff Picks, as it turns out — and this format requires a little more effort on our end to sort out.
Sources that aren’t easily seekable on a network. For those, we need to fall back to downloading the source on disk first.

How it went for us

I’m going to say that it went well, although, apart from a few video-related bugs, for which we sent patches upstream where appropriate, we ran into a few infrastructure issues.

First, we had to run the API and the workers in separate Kubernetes clusters, because GKE Ingress wouldn’t work if there were more than 1,000 nodes in the cluster. It’s worth noting that this isn’t a limitation anymore.

Second, with Google Cloud’s VPC-native cluster, each pod gets its own IP address, and because we’d have many, many pods, we didn’t want to peer this cluster with the rest of Vimeo’s infrastructure, since we’d have used a sizable portion of the 10.x.x.x internal IP range. We set up Cloud NAT to still be able to talk to other parts of the infrastructure, such as our observability services.

Third, a few parts of our state machine didn’t handle duplicate messages well. Google Cloud’s Pub/Sub has at-least-once delivery guarantees, but not exactly-once. We had to rewrite a few chunks of our code to be more resilient, and it’s now something we have in mind when writing new code.

Fourth, for availability, we run in three regions of the United States, which has some egress cost implications.

What we gained from it

Costs are down, and speed is up. Falkor is significantly cheaper than Tron would be under similar conditions. And we still have knobs to tweak to optimize it further.

Also, while Falkor comes with some overhead, which means that short videos can spend as much if not more time in the transcoding pipeline as they did under the old regime, longer videos are ready for playback much faster. Our users are happier, so it’s worth the tradeoff.