通过影子测试增强 Flink 部署

Introduction

引言

Ensuring the reliability of Apache Flink deployments in Grab is crucial for the availability of our business-critical, real-time applications. While all applications are tested in a staging environment before getting promoted to the production environment, there is still a class of issues that can only surface when deploying in the production environment, e.g.:

确保 Grab 中 Apache Flink 部署的可靠性对于我们业务关键的实时应用可用性至关重要。虽然所有应用在提升到生产环境之前都会在 staging 环境中进行测试,但仍有一类问题只有在生产环境中部署时才会显现,例如:

  • The new version of the application is unable to cope with the volume or the nature of production traffic.
  • 应用的新版本无法应对生产流量的量或性质。
  • The new version of the application is unable to resume from a production checkpoint or savepoint taken by the previous version of the application.
  • 应用的新版本无法从前一版本应用所采取的生产 checkpoint 或 savepoint 恢复。
  • Certain environment-specific dependencies or configurations are malfunctioning or misconfigured.
  • 某些特定于环境的依赖项或配置出现故障或配置错误。

When an application faces such issues upon deployment in production, our in-house deployment system automatically rolls it back after 10 minutes of observation, leading to a downtime of the application for about the same duration.

当应用在生产环境中部署时遇到此类问题时,我们的内部部署系统会在观察 10 分钟后自动回滚,导致应用停机时间约为相同时长。

In this article, we will describe how Grab’s data streaming team (Coban) has enriched the traditional deployment pipeline for Flink applications with a Shadow Testing stage that eliminates this downtime during deployment failures, enhancing the availability of our Flink applications during this critical moment of their lifecycle.

在本文中,我们将描述 Grab 的数据流团队(Coban)如何通过添加 Shadow Testing 阶段 来丰富 Flink 应用的传统部署管道,该阶段消除了部署失败时的停机时间,从而提升了 Flink 应用在其生命周期关键时刻的可用性。

Shadow Testing is a testing technique whereby a new version of an application (Shadow) is deployed in parallel with the current version of the application (Main), but without impacting it. It involves replicating production data to the new version of the application and comparing its behavior with the current version of the application to identify potential issues ...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2026 iteam. Current version is 2.155.2. UTC+08:00, 2026-05-08 08:50
浙ICP备14020137号-1 $访客地图$