我们在 Vimeo 如何构建 AI 驱动的字幕

In the world of text, translation is largely viewed as a solved problem. If you take a transcript and feed it into a modern LLM, the output is often indistinguishable from human work. It catches nuances, fixes grammar, and flows fairly naturally.

在文本世界中，翻译在很大程度上被视为一个已解决的问题。如果你拿一份字幕稿输入现代 LLM，输出往往与人工工作无异。它捕捉细微差别，修正语法，并且流畅自然。

But building a translation product for video adds a layer of complexity that pure text generation ignores: time.

但是，为视频构建翻译产品增加了一层纯文本生成忽略的复杂性：时间。

At Vimeo, we leverage LLMs to generate subtitles for our users. And while getting the words right is the first hurdle, the actual product experience relies on something else entirely: Synchronization.

在 Vimeo，我们利用 LLM 为用户生成字幕。虽然让单词正确是第一个障碍，但实际的产品体验完全依赖于其他东西：同步。

The core technical challenge isn’t just “What does this sentence mean in Japanese?” It is “How do we fit this meaning into the exact same three-second window as the original English speaker?”

核心技术挑战不只是“这句日语是什么意思？”而是“我们如何将这个意思塞进与原英语说话者相同的精确三秒窗口？”

Figure 1. Source and translation must stay in sync, line by line.

Figure 1. 源文本和翻译必须逐行保持同步。

The conflict: Fluency vs. the timeline

冲突：流畅性 vs. 时间线

Our subtitle system operates on a strict logic: it expects every line of source text to have a corresponding translated line to ensure proper timing. If the speaker talks from 0:05 to 0:08, the system expects text to fill that specific slot.

我们的字幕系统基于严格的逻辑运行：它期望源文本的每一行都有对应的翻译行，以确保正确的时机。如果说话者从 0:05 说到 0:08，系统期望文本填充那个特定槽位。

The problem is that LLMs are designed to be smart. When they detect a speaker stumbling over their words or using filler like um or you know, they intuitively want to clean it up and merge fragmented thoughts into single, grammatically correct sentences (which varies depending on the LLM).

问题是 LLMs 被设计得太聪明了。当它们检测到说话者结巴或使用像 um 或 you know 这样的填充词时，它们本能地想要清理并将碎片化的想法合并成单个语法正确的句子（这取决于 LLM 而有所不同）。

While this improves readability, it breaks the playback mechanics. Imagine a speaker says: “Um, you know, I think that we’re gonna get… [pause] …we’re gonna remove a lot of barriers.”

虽然这提高了可读性，但它破坏了播放机制。想象一个说...