为 Debezium 贡献:在大规模下修复逻辑复制

At Zalando, we run hundreds of event streams powered by PostgreSQL logical replication through our Fabric Event Streams platform, a Kubernetes-based approach that allows teams to declare event streams sourcing directly from their Postgres databases. Each stream declaration provisions a micro application that uses Debezium in embedded mode to publish row-level change events as they occur. At peak traffic, these combined connectors process hundreds of thousands of events per second across our 100+ Kubernetes clusters.
在Zalando,我们通过我们的Fabric Event Streams平台运行数百个由PostgreSQL逻辑复制驱动的事件流,这是一种基于Kubernetes的方法,允许团队直接从他们的Postgres数据库声明事件流。每个流声明配置一个微应用程序,该程序使用嵌入模式的Debezium在发生时发布行级更改事件。在高峰流量时,这些组合连接器在我们100多个Kubernetes集群中每秒处理数十万事件。
This infrastructure has been in operation since late 2018, processing billions of events over the years, but getting here required solving some hard problems with logical replication. This is the story of how we contributed two features to Debezium that we hope will help everyone using logical replication at scale.
自 2018 年底以来,这个基础设施一直在运行,处理了数十亿个事件,但达到这一点需要解决一些与逻辑复制相关的难题。这是我们如何为 Debezium 贡献两个功能的故事,我们希望这些功能能帮助所有使用大规模逻辑复制的人。
The WAL Growth Problem Returns
WAL增长问题再次出现
A couple of years ago, our colleague Declan Murphy wrote about a critical issue with PostgreSQL logical replication where low-activity databases experienced runaway Write-Ahead Log (WAL) growth. The problem was simple: replication slots wouldn't advance without table activity, causing WAL to pile up until disk space ran out. Our single biggest operational issue when rolling out this event infrastructure at scale was uncontrolled WAL growth on low-activity databases, even with heartbeat configured.
几年前,我们的同事 Declan Murphy 写过一个关于 PostgreSQL 逻辑复制的关键问题,低活动数据库经历了失控的预写日志 (WAL) 增长。问题很简单:复制槽在没有表活动的情况下不会前进,导致 WAL 堆积,直到磁盘空间耗尽。在大规模推出这个事件基础设施时,我们面临的最大操作问题是低活动数据库上无法控制的 WAL 增长,即使配置了心跳。
As detailed in Declan's blog post, we fixed this upstream in the PostgreSQL JDBC driver by having the driver respond to keepalive messages from Postgres, advancing the rep...