How We Built AI-Powered Subtitles at Vimeo

In the world of text, translation is largely viewed as a solved problem. If you take a transcript and feed it into a modern LLM, the output is often indistinguishable from human work. It catches nuances, fixes grammar, and flows fairly naturally.

But building a translation product for video adds a layer of complexity that pure text generation ignores: time.

At Vimeo, we leverage LLMs to generate subtitles for our users. And while getting the words right is the first hurdle, the actual product experience relies on something else entirely: Synchronization.

The core technical challenge isn’t just “What does this sentence mean in Japanese?” It is “How do we fit this meaning into the exact same three-second window as the original English speaker?”

Figure 1. Source and translation must stay in sync, line by line.

The conflict: Fluency vs. the timeline

Our subtitle system operates on a strict logic: it expects every line of source text to have a corresponding translated line to ensure proper timing. If the speaker talks from 0:05 to 0:08, the system expects text to fill that specific slot.

The problem is that LLMs are designed to be smart. When they detect a speaker stumbling over their words or using filler like um or you know, they intuitively want to clean it up and merge fragmented thoughts into single, grammatically correct sentences (which varies depending on the LLM).

While this improves readability, it breaks the playback mechanics. Imagine a speaker says: “Um, you know, I think that we’re gonna get… [pause] …we’re gonna remove a lot of barriers.”

A traditional translation maps this one to one. But an LLM produces a single, fluent Japanese sentence: “We will be able to remove many barriers regarding pricing.”

Figure 2 shows what happens to the video player.

Figure 2. The blank screen bug! the LLM merged both lines into one, leaving the second slot empty.

Because the LLM effectively solved the sentence in the first line, the second time slot is left empty. To the viewer, the subtitles just disappear while the person keeps talking.

It gets harder: The geometry of language

This synchronization challenge becomes even more acute when you step outside of English or the Romance languages.

Different languages don’t just use different words; they organize thought in fundamentally different orders — often in ways that defy a linear timeline. We started calling this the geometry of language.

We identified specific linguistic patterns that consistently break time-syncing.

1. The German erb bracket (Satzklammer)

English is generally linear: “We are arming the partners with resources.” The action (arming) happens early in the sentence.

German, however, frequently places the conjugated verb in the second position and the infinitive or participle at the very end of the sentence. This creates a bracket that holds the entire thought together (see Figure 3).

Figure 3. The German verb bracket in action — splitting at the line boundary leaves the first subtitle grammatically incomplete.

The LLM resists this because it looks like a syntax error.

2. The Hindi reversal (SOV Order)

Hindi follows a subject-object-verb (SOV) order, which is effectively the reverse of English (see Figure 4).

Figure 4. Hindi’s inverted order means the end of the English thought maps to the middle of the Hindi sentence breaking timestamp alignment.

The LLM collapses the lines because the end of the English thought is the middle of the Hindi thought.

3. The Japanese compression (information density)

Japanese is often significantly more information-dense than English. Where a speaker might ramble through filler words and fragmented thoughts, a Japanese translation consolidates them into a single, grammatically tight sentence (see Figure 5).

Figure 5. Japanese compression where the LLM condenses four lines of filler into one clean sentence, leaving three empty slots.

The LLM correctly recognizes that the English filler adds no semantic value. It produces one clean sentence — but now the system has four time slots to fill and only enough text for one. The remaining three slots go blank while the speaker keeps talking.

The LLM collapses the lines because it’s optimizing for fluency, not for the subtitle grid.

To solve these issues, we couldn’t just ask the model to translate. We had to architect a system that respects these geometries while strictly adhering to the video’s timeline.

The solution: A split-brain architecture

We realized that asking a single LLM prompt to translate this efficiently and keep the exact line count was a losing battle. The creative requirement (fluency) was constantly fighting the structural requirement (timing).

So, we stopped trying to do it all in one pass.

We re-architected the pipeline into a three-phase approach. Phase 1 prepares the context; Phases 2 and 3 form what we call the “split-brain” — effectively separating the creative translator from the strict timekeeper (see Figure 6).

Figure 6. The full pipeline — creative translation and structural sync as separate passes, with correction loops for the approximately 5 percent that fail.

Phase 1: Smart chunking (Giving the model context)

The first mistake many systems make is feeding the transcript line-by-line. If you feed an LLM just one line — arming the partners — it has zero context. It doesn’t know who is being armed or what they are being armed with.

Conversely, if you feed it the whole file, it hallucinates and loses track of where it is.

We built a chunking algorithm that finds the middle ground: scanning for sentence boundaries (periods, question marks, exclamation points) across at least ten punctuation systems and grouping text into logical thought blocks of typically 3–5 lines. The LLM always sees a complete thought before translating.

Phase 2: Translation (The creative pass)

We hand each chunk to the LLM with one instruction: translate for meaning. No line count enforcement. We want it to handle the German verb bracket naturally, reorder Hindi syntax correctly, and consolidate Japanese efficiently. Linguistic quality above all else.

The output is a chunk translation — fluent text that accurately represents the source, even if the structure has changed.

Phase 3: Line mapping (The synchronization pass)

Here’s where we reconcile fluency with timing. We take that fluent, grammatically correct block of text and feed it into a second LLM call — the line mapper.

The prompt here is purely structural with no concern for meaning, only line count. We essentially say: “Here are the original four English lines with timestamps. Here is the translated block. Break it back into four lines to match the source rhythm.”

By separating these concerns, we get the best of both worlds:

Phase 2 ensures the sentence is grammatically sound (solving the fluency problem).
Phase 3 ensures the timing is respected (solving the blank screen problem).

Here, a whopping 95 percent of chunks map perfectly on the first pass. Which leads us to our design principle:

We stopped asking, “How do we make the LLM get it right the first time?” and started asking, “What happens when it doesn’t?”

That reframe shaped everything that follows.

The self-healing pipeline

The remaining five percent or so triggers a correction loop. When the line mapper returns a mismatch — one line of German when we asked for two — the system retries with explicit feedback about the error. Often the model finds a valid synonym or slightly less natural phrasing that respects the line count.

If that fails, we fall back to progressively simpler splitting strategies.

LLM fallback

We strip away the semantic instructions and give the model a bare-bones task: “Here’s one block of text. Split it into exactly N lines.” No translation quality checks, no concern for clause boundaries. The prompt explicitly states that semantic accuracy is not important — only the line count matters.

Deterministic fallback

If the LLM still can’t produce the right count, we stop asking models entirely. A rule-based algorithm takes over:

Empty lines? Fill them with the last valid content seen
Too few lines? Duplicate the last valid content to pad the count
Too many lines? Truncate from the end

The output in these edge cases is functional rather than perfect — you might see the same phrase repeated across a few subtitle slots. But every time slot gets filled, and the viewer never sees a blank screen.

The cost of self-healing

Because there’s always a cost:

First-pass success. No additional overhead for approximately 95 percent of chunks.
LLM Correction loop. Adds approximately 2–3 seconds and ~25 percent more tokens per retry of a chunk (the correction prompt includes the faulty output plus a reasoning field). Most resolve within one or two attempts.
Deterministic fallback. Instant and zero tokens — it’s pure algorithm. This applies to about 3–4 percent of chunks

The net impact: 4–8 percent increase in total processing time, 6–10 percent increase in token cost. The tradeoff: zero blank-screen bugs and roughly 20 hours of manual QA avoided per 1,000 videos.

The verdict: Did it work?

After running this architecture against thousands of chunks across nine languages, the results confirmed our hypothesis: you can have fluent translations and valid subtitles, but you need a pipeline that expects failure:

The three-phase architecture does the heavy lifting. On the first pass (Phase 3: Line Mapping), nearly 95 percent of chunks map perfectly to the source lines.
The negotiator works. Of the 5 percent or so of chunks that fail, our LLM correction pipeline successfully fixes about 32 percent of them by finding valid synonyms or rephrasing.
The fallback catches the rest. The remaining failures (mostly in difficult languages like Hindi and German) are handled by the deterministic splitter, ensuring that 100 percent of the chunks reach the user in a valid state.

Interestingly, the data validated our geometry of language theory. Romance languages like Spanish and Italian had a clean pass rate of about 99 percent. Meanwhile, Hindi — with its reversed sentence structure — had a mismatch rate of 28.1 percent, which means that it relied heavily on our safety nets to function.

If you’re building something similar

Specifically, AI-powered subtitles:

Expect structural mismatches in non-Romance languages. German, Hindi, and Japanese broke our sync at three to five times the rate of Spanish or Italian. Budget your error handling accordingly.
Separate creative translation from structural mapping. Asking one prompt to translate fluently and preserve line count is a losing battle. Split the tasks.
Build correction loops that assume failure. The question isn’t how to prevent mismatches. It’s what to do when they happen. Design the fallback chain first, not last.

Conclusion: The infrastructure tax of intelligence

Building this system taught us something counterintuitive about AI product development: making the model smarter can make the product worse (in some ways).

A dumb translator is easy to sync — it translates word for word, and every input line produces exactly one output line. A smarter translator that is prompted to consider nuances and semantics is harder to sync because it translates thought for thought. It merges, reorders, condenses. The output is better linguistically but breaks the infrastructure that expects predictable structure.

The solution wasn’t to dumb down the model. It was to pay the infrastructure tax — building smarter systems around the model to absorb its creative unpredictability:

Separate creative translation from structural mapping.
Build negotiation layers that expect failure.
Design fallbacks that preserve quality when the math doesn’t add up.