MediaFM: Netflix 用于媒体理解的多模态 AI 基础模型

Avneesh Saluja, Santiago Castro, Bowei Yan, Ashish Rastogi

Introduction

引言

Netflix’s core mission is to connect millions of members around the world with stories they’ll love. This requires not just an incredible catalog, but also a deep, machine-level understanding of every piece of content in that catalog, from the biggest blockbusters to the most niche documentaries. As we onboard new types of content such as live events and podcasts, the need to scalably understand these nuances becomes even more critical to our productions and member-facing experiences.

Netflix 的核心使命是将全球数百万会员与他们喜爱的故事连接起来。这不仅仅需要一个令人难以置信的目录，还需要对目录中每一件内容进行深入的、机器级别的理解，从最大的票房大片到最具小众性的纪录片。当我们引入直播活动和播客等新型内容时，可扩展地理解这些细微差别对于我们的制作和面向会员的体验变得更加关键。

Many of these media-related tasks require sophisticated long-form video understanding e.g., identifying subtle narrative dependencies and emotional arcs that span entire episodes or films. Previous work has found that to truly grasp the content’s essence, our models must leverage the full multimodal signal. For example, the audio soundtrack is a crucial, non-visual modality that can help more precisely identify clip-level tones or when a new scene starts. Can we use our collection of shows and movies to learn how to a) fuse modalities like audio, video, and subtitle text together and b) develop robust representations that leverage the narrative structure that is present in long form entertainment? Consisting of tens of millions of individual shots across multiple titles, our diverse yet entertainment-specific dataset provides the perfect foundation to train multimodal media understanding models that enable many capabilities across the company such as ads relevancy, clip popularity prediction, and clip tagging.

这些与媒体相关的许多任务需要复杂的长篇视频理解，例如识别跨越整个剧集或电影的细微叙事依赖性和情感弧线。先前工作发现，要真正把握内容的本质，我们的模型必须利用完整的多模态信号。例如，音频原声带是一种关键的非视觉模态，可以帮助更精确地识别片段级别的基调或新场景开始的时间。我们能否利用我们的节目和电影集合来学习如何 a) 将音频、视频和字幕文本等模态融合在一起，以及 b) 开发利用长篇娱乐内容中存在的叙事结构的鲁棒表示？我们的数据集由数千万个跨多个标题的独立镜头组成，既多样又特定于娱乐，为训练...