我们如何无缝地将高容量实时流量从一个服务迁移到另一个服务,且无数据丢失和重复
At Grab, we continuously enhance our systems to improve scalability, reliability and cost-efficiency. Recently, we undertook a project to split the read and write functionalities of one of our backend services into separate services. This was motivated by the need to independently scale these operations based on their distinct scalability requirements.
在Grab,我们不断增强系统以提高可扩展性、可靠性和成本效益。最近,我们进行了一个项目,将我们一个后端服务的读写功能拆分为独立的服务。这是由于需要根据它们不同的可扩展性需求独立扩展这些操作。
In this post, we will dive deep into how we migrated the stream processing (write) functionality to a new service with zero data loss and duplication. This was accomplished while handling a high volume of real-time traffic averaging 20,000 reads per second from 16 source Kafka streams writing to other output streams and several DynamoDB tables.
在这篇文章中,我们将深入探讨如何将流处理(写入)功能迁移到一个新的服务中,并实现零数据丢失和重复。这是在处理每秒平均20,000次读取的高流量实时流量的同时完成的,这些流量来自16个源Kafka流,写入其他输出流和多个DynamoDB表。
Migration challenges and strategy
迁移挑战和策略
Migrating the stream processing to the new service while ensuring zero data loss and duplication posed some interesting challenges, especially given the high volume of real-time data. We needed a strategy that would enable us to:
在确保零数据丢失和重复的情况下,将流处理迁移到新服务带来了一些有趣的挑战,特别是在高容量的实时数据情况下。我们需要一个能够使我们实现以下目标的策略:
- Migrate streams one by one gradually.
- 逐步迁移流。
- Validate the new service’s processing in production before fully switching over.
- 在完全切换之前验证新服务在生产环境中的处理。
- Perform the switchover with no downtime or data inconsistencies.
- 在没有停机或数据不一致的情况下进行切换。
We considered various options for the switchover such as using feature flags via our unified config management and experimental rollout platform. However, these approaches had some limitations:
我们考虑了各种切换的选项,例如通过我们的统一配置管理和实验性发布平台使用功能标志。然而,这些方法有一些限制:
- There could be some data loss or duplication during the deployment time when toggling the flags, which can be up to a few minutes.
- 在切换标志时的部署期间可能会有一些数据丢失或重复,这可能会持续几分钟。
- There might be data inconsistencies as the flag value could be updated on the services (the existing and and the new one) at slightly d...