在Netflix构建一个具有前写日志的弹性数据平台

By Prudhviraj Karumanchi, Samuel Fu, Sriram Rangarajan, Vidhya Arvind, Yun Wang, John Lu

作者 Prudhviraj KarumanchiSamuel FuSriram RangarajanVidhya ArvindYun WangJohn Lu

Introduction

介绍

Netflix operates at a massive scale, serving hundreds of millions of users with diverse content and features. Behind the scenes, ensuring data consistency, reliability, and efficient operations across various services presents a continuous challenge. At the heart of many critical functions lies the concept of a Write-Ahead Log (WAL) abstraction. At Netflix scale, every challenge gets amplified. Some of the key challenges we encountered include:

Netflix 在大规模运营,服务数亿用户,提供多样的内容和功能。在幕后,确保数据一致性、可靠性以及在各种服务之间高效操作是一个持续的挑战。在许多关键功能的核心是写前日志(WAL)抽象的概念。在 Netflix 的规模下,每个挑战都会被放大。我们遇到的一些关键挑战包括:

  • Accidental data loss and data corruption in databases
  • 数据库中的意外数据丢失和数据损坏
  • System entropy across different datastores (e.g., writing to Cassandra and Elasticsearch)
  • 不同数据存储之间的系统熵(例如,写入 Cassandra 和 Elasticsearch)
  • Handling updates to multiple partitions (e.g., building secondary indices on top of a NoSQL database)
  • 处理对多个分区的更新(例如,在 NoSQL 数据库上构建二级索引)
  • Data replication (in-region and across regions)
  • 数据复制(区域内和跨区域)
  • Reliable retry mechanisms for real time data pipeline at scale
  • 大规模实时数据管道的可靠重试机制
  • Bulk deletes to database causing OOM on the Key-Value nodes
  • 对数据库的批量删除导致 Key-Value 节点的 OOM

All the above challenges either resulted in production incidents or outages, consumed significant engineering resources, or led to bespoke solutions and technical debt. During one particular incident, a developer issued an ALTER TABLE command that led to data corruption. Fortunately, the data was fronted by a cache, so the ability to extend cache TTL quickly together with the app writing the mutations to Kafka allowed us to recover. Absent the resilience features on the application, there would have been permanent data loss. As the data platform team, we needed to provide resilience and guarantees to protect not just this application, but all the critical applications we have at Netflix.

上述所有挑战要...

开通本站会员,查看完整译文。

inicio - Wiki
Copyright © 2011-2025 iteam. Current version is 2.147.0. UTC+08:00, 2025-10-29 05:05
浙ICP备14020137号-1 $mapa de visitantes$