公司:Netflix
Netflix(/ˈnɛtflɪks/)(官方中文译名网飞,非官方中文译名奈飞)是起源于美国、在世界各地提供网络视频点播的OTT服务公司,并同时在美国经营单一费率邮寄影像光盘出租服务,后者是使用回邮信封寄送DVD和Blu-ray出租光盘至消费者指定的收件地址。公司由里德·哈斯廷斯和马克·兰多夫在1997年8月29日成立,总部位于加利福尼亚州的洛斯盖图,1999年开始推出订阅制的服务。2009年,Netflix已可提供超过10万部电影DVD,订阅者数超过1000万人。另一方面,截至2022年6月的数据,Netflix的流服务已经在全球拥有2.20亿个订阅用户,在美国的订户已达到7330万。其主要的竞争对手有Disney+、Hulu、HBO Max、Amazon Prime Video、YouTube Premium及Apple TV+等。
Netflix在多个排行榜上均榜上有名:2017年6月6日,《2017年BrandZ最具价值全球品牌100强》公布,Netflix名列第92位。2018年10月,《财富》未来公司50强排行榜发布,Netflix排名第八。2018年12月,世界品牌实验室编制的《2018世界品牌500强》揭晓,排名第88。在《财富》2018年世界500大排名261位,并连年增长。2019年10月,位列2019福布斯全球数字经济100强榜第46名。2019年10月,Interbrand发布的全球品牌百强榜排名65。2020年1月22日,名列2020年《财富》全球最受赞赏公司榜单第16位。2022年2月,按市值计算,Netflix为全球第二大的媒体娱乐公司。2019年,Netflix加入美国电影协会(MPA)。另外,Netflix也被部分媒体列为科技巨擘之一。
Investigation of a Workbench UI Latency Issue
At Netflix, the Analytics and Developer Experience organization, part of the Data Platform, offers a product called Workbench. Workbench is a remote development workspace based on Titus that allows data practitioners to work with big data and machine learning use cases at scale. A common use case for Workbench is running JupyterLab Notebooks.
Recently, several users reported that their JupyterLab UI becomes slow and unresponsive when running certain notebooks. This document details the intriguing process of debugging this issue, all the way from the UI down to the Linux kernel.
Introducing Netflix’s TimeSeries Data Abstraction Layer
As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming, the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital. In previous blog posts, we introduced the Key-Value Data Abstraction Layer and the Data Gateway Platform, both of which are integral to Netflix’s data architecture. The Key-Value Abstraction offers a flexible, scalable solution for storing and accessing structured key-value data, while the Data Gateway Platform provides essential infrastructure for protecting, configuring, and deploying the data tier.
Building on these foundational abstractions, we developed the TimeSeries Abstraction — a versatile and scalable solution designed to efficiently store and query large volumes of temporal event data with low millisecond latencies, all in a cost-effective manner across various use cases.
Introducing Netflix’s Key-Value Data Abstraction Layer
In this post, we dive deep into how Netflix’s KV abstraction works, the architectural principles guiding its design, the challenges we faced in scaling diverse use cases, and the technical innovations that have allowed us to achieve the performance and reliability required by Netflix’s global operations.
Noisy Neighbor Detection with eBPF
The Compute and Performance Engineering teams at Netflix regularly investigate performance issues in our multi-tenant environment. The first step is determining whether the problem originates from the application or the underlying infrastructure. One issue that often complicates this process is the "noisy neighbor" problem. On Titus, our multi-tenant compute platform, a "noisy neighbor" refers to a container or system service that heavily utilizes the server's resources, causing performance degradation in adjacent containers. We usually focus on CPU utilization because it is our workloads’ most frequent source of noisy neighbor issues.
Detecting the effects of noisy neighbors is complex. Traditional performance analysis tools such as perf can introduce significant overhead, risking further performance degradation. Additionally, these tools are typically deployed after the fact, which is too late for effective investigation. Another challenge is that debugging noisy neighbor issues requires significant low-level expertise and specialized tooling. In this blog post, we'll reveal how we leveraged eBPF to achieve continuous, low-overhead instrumentation of the Linux scheduler, enabling effective self-serve monitoring of noisy neighbor issues. You’ll learn how Linux kernel instrumentation can improve your infrastructure observability with deeper insights and enhanced monitoring.
Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future
Pushy is Netflix’s WebSocket server that maintains persistent WebSocket connections with devices running the Netflix application. This allows data to be sent to the device from backend services on demand, without the need for continually polling requests from the device. Over the last few years, Pushy has seen tremendous growth, evolving from its role as a best-effort message delivery service to be an integral part of the Netflix ecosystem. This post describes how we’ve grown and scaled Pushy to meet its new and future needs, as it handles hundreds of millions of concurrent WebSocket connections, delivers hundreds of thousands of messages per second, and maintains a steady 99.999% message delivery reliability rate.
Recommending for Long-Term Member Satisfaction at Netflix
Our mission at Netflix is to entertain the world. Our personalization algorithms play a crucial role in delivering on this mission for all members by recommending the right shows, movies, and games at the right time. This goal extends beyond immediate engagement; we aim to create an experience that brings lasting enjoyment to our members. Traditional recommender systems often optimize for short-term metrics like clicks or engagement, which may not fully capture long-term satisfaction. We strive to recommend content that not only engages members in the moment but also enhances their long-term satisfaction, which increases the value they get from Netflix, and thus they’ll be more likely to continue to be a member.
Improve Your Next Experiment by Learning Better Proxy Metrics From Past Experiments
By Aurélien Bibaut, Winston Chou, Simon Ejdemyr, and Nathan Kallus
Investigation of a Cross-regional Network Performance Issue
Netflix operates a highly efficient cloud computing infrastructure that supports a wide array of applications essential for our SVOD (Subscription Video on Demand), live streaming and gaming services. Utilizing Amazon AWS, our infrastructure is hosted across multiple geographic regions worldwide. This global distribution allows our applications to deliver content more effectively by serving traffic closer to our customers. Like any distributed system, our applications occasionally require data synchronization between regions to maintain seamless service delivery.
Java 21 Virtual Threads - Dude, Where’s My Lock?
Getting real with virtual threads.
Maestro: Netflix’s Workflow Orchestrator
Maestro is a general-purpose, horizontally scalable workflow orchestrator designed to manage large-scale workflows such as data pipelines and machine learning model training pipelines. It oversees the entire lifecycle of a workflow, from start to finish, including retries, queuing, task distribution to compute engines, etc.. Users can package their business logic in various formats such as Docker images, notebooks, bash script, SQL, Python, and more. Unlike traditional workflow orchestrators that only support Directed Acyclic Graphs (DAGs), Maestro supports both acyclic and cyclic workflows and also includes multiple reusable patterns, including foreach loops, subworkflow, and conditional branch, etc.
Enhancing Netflix Reliability with Service-Level Prioritized Load Shedding
Applying Quality of Service techniques at the application level.
Video annotator: a framework for efficiently building video classifiers using vision-language models and active learning
这篇文章介绍了一个名为Video Annotator (VA)的交互式框架,用于标注视频数据。VA利用大型视觉语言模型的零样本能力和主动学习技术,提高了样本效率和降低成本。它提供了一种独特的方法来标注、管理和迭代视频分类数据集,强调领域专家在人机交互系统中的直接参与。通过在标注过程中让用户快速做出决策,VA提高了系统的整体效率。它还支持持续的标注过程,用户可以快速部署模型、监控质量并迅速修复问题。这种自助式架构赋予领域专家在不需要数据科学家或第三方标注者的参与下进行改进的能力,建立了对系统的信任。经过实验,VA在多个视频理解任务中相对于竞争对手平均提高了8.3个平均精度点。他们还发布了一个包含153k标签的数据集和复制实验的代码。
Round 2: A Survey of Causal Inference Applications at Netflix
文章讨论了Netflix的实验平台,强调了在产品设计中考虑用户需求和数据呈现方式的重要性。作者通过比较表格、饼图、堆叠柱状图和柱状图等不同数据呈现方式的优缺点,强调了设计对于用户理解数据的影响。作者还提到了设计在交互体验和产品策略中的作用,以及如何通过关注设计来确保工具能够最大限度地帮助团队从实验中学习。文章还提到了来自哈佛大学的Kosuke Imai教授在演讲中介绍了一种名为“cram method”的学习和评估治疗政策的方法。
The Making of VES: the Cosmos Microservice for Netflix Video Encoding
Netflix的Cosmos平台是下一代媒体计算平台,旨在通过提高灵活性、效率和开发人员的生产力,现代化Netflix的媒体处理流程。其中一个微服务是视频编码服务(VES),它将输入的原始素材编码成适合Netflix流媒体或某些工作室/制作用途的视频流。VES通过多个编解码器格式、分辨率和质量级别的支持,满足多设备、低延迟、快速创新和成本效益的要求。VES构建在Cosmos的三个层级上,包括API层(Optimus)、工作流层(Plato)和无服务器计算层(Stratum),并通过优先级消息传递系统Timestone进行异步通信。VES的构建过程中,团队学到了微服务架构的多个经验教训,包括定义适当的服务范围和通过持续发布来支持新的业务需求、提升性能和改进韧性。在第二次迭代中,团队通过将不同编解码器格式的编码合并到一个服务中,减少了代码重复,同时保证了每种编解码器格式的独立演进。此外,团队还强调了在数据建模方面要实事求是,平衡共享和耦合的关系。
Reverse Searching Netflix’s Federated Graph
Netflix的Content Engineering团队与Studio Engineering团队合作开发了Reverse Search功能。该功能可以根据文档查询与之匹配的搜索条件,实现精确的查询结果。通过将搜索条件保存为SavedSearches
,并将其转换为Elasticsearch查询语句,在percolator字段中进行索引。此外,reverse search还可用于创建更响应的UI。通过GraphQL订阅,搜索结果可以实时更新,而不是一次性查询。这些订阅可以与SavedSearch
相关联,并利用reverse search来确定何时更新订阅返回的键集合。总之,reverse search是一个功能强大的外部标准匹配器,不仅适用于电影标准,还可用于任何具有逆向搜索能力的索引。
Sequential A/B Testing Keeps the World Streaming Netflix Part 2: Counting Processes
Michael Lindon, Chris Sanden, Vache Shirikian, Yanjun Liu, Minal Mishra, Martin Tingley