在Snowflake中用SQL对Debezium事件进行扁平化处理
As a company grows, data engineering teams need to focus on scaling to meet the increasing demands from BI reporting, product analytics, and data science. In the beginning, it’s enough to report directly from the application database to get the data you need. Next, you might add some read-only databases to move the analytics workloads off of the primary database. Eventually, you may arrive at a similar architecture as in the diagram above; you replicate your application’s database to a separate database that is optimized for analytics. This approach can help a company scale for application and analytics requirements independently.
随着公司的发展,数据工程团队需要专注于扩展,以满足来自BI报告、产品分析和数据科学日益增长的需求。在开始的时候,直接从应用数据库中报告就可以得到你需要的数据。接下来,你可能会增加一些只读数据库,将分析工作负载从主数据库中移出。最终,你可能会达到一个类似于上图的架构;你把你的应用程序的数据库复制到一个单独的、为分析而优化的数据库。这种方法可以帮助公司独立地扩展应用和分析需求。
In the case of Vimeo, we want to replicate MySQL database tables to the Snowflake database; Snowflake is a cloud computing-based data warehouse that we use around here. But as we capture changes from MySQL into our data warehouse, we need to do some transformations.
在Vimeo的案例中,我们想把MySQL数据库表复制到Snowflake数据库;Snowflake是我们在这里使用的基于云计算的数据仓库。但是,当我们从MySQL捕获变化到我们的数据仓库,我们需要做一些转换。
This article explains how we approach transforming the MySQL changes into Snowflake tables, which is sure to be of interest to you if you stream Debezium events into Snowflake using the Snowflake Connector for Kafka and you want to flatten these events into tables. (What’s Debezium? Read on.)
这篇文章解释了我们如何将MySQL的变化转化为Snowflake表,如果你使用Kafka的Snowflake连接器将Debezium的事件流到Snowflake,并且你想将这些事件平铺到表中,你肯定会感兴趣。什么是Debezium,请继续阅读)。
A little background
一点背景
The change data capture or CDC approach is common for replicating a source system’s state into a data warehouse; changes to data in the source are captured by consumers to take some action. This facilitates analytics and decision-making based on multiple data sources by using a central data warehouse.
变化数据捕获或CDC方法是常见的,用于将源系统的状态复制到数据仓库中;源中数据的变化被消费者捕获以采取一些行动。这有利于通过使用一个中央数据仓库,在多个数据源的基础上进行分析和决策。
We’ve set u...