升级Airbnb的数据仓库基础设施
This blog aims to introduce Airbnb’s experience upgrading Data Warehouse infrastructure to Spark and Iceberg.
本博客旨在介绍Airbnb将数据仓库基础设施升级到Spark和Iceberg的经验。
By: Ronnie Zhu, Edgar Rodriguez, Jason Xu, Gustavo Torres, Kerim Oktay, Xu Zhang
作者Ronnie Zhu,Edgar Rodriguez,Jason Xu,Gustavo Torres,Kerim Oktay,Xu Zhang
Introduction
简介
In this blog, we will introduce our motivations for upgrading our Data Warehouse Infrastructure to Spark 3 and Iceberg. We will briefly describe the current state of Airbnb data warehouse infrastructure and the challenges. We will then share our learnings from upgrading one critical production workload: event data ingestion. Finally, we will share the results and the lessons learned.
在这篇博客中,我们将介绍我们将数据仓库基础设施升级到Spark 3和Iceberg的动机。我们将简要地描述Airbnb数据仓库基础设施的现状和挑战。然后,我们将分享我们从升级一个关键的生产工作负载的学习:事件数据摄入。最后,我们将分享结果和学到的教训。
Context
背景介绍
Airbnb’s Data Warehouse (DW) storage was previously migrated from legacy HDFS clusters to S3 to provide better stability and scalability. While our team has continued to improve the reliability and stability of the workloads that operate on data in S3, certain characteristics of these workloads and the infrastructure they depend on introduce scalability and productivity limitations that our users encounter on a regular basis.
Airbnb的数据仓库(DW)存储以前从传统的HDFS集群迁移到S3,以提供更好的稳定性和扩展性。虽然我们的团队不断提高在S3中的数据上运行的工作负载的可靠性和稳定性,但这些工作负载的某些特性和它们所依赖的基础设施引入了可扩展性和生产力的限制,我们的用户经常遇到这种情况。
Challenges
挑战
Hive Metastore
Hive Metastore
With an increasing number of partitions, Hive’s backend DBMS’s load has become a bottleneck, as has the load on partition operations (e.g., querying thousands of partitions for a month’s worth of data). As a workaround, we usually add a stage of daily aggregation and keep two tables for queries of different time granularities (e.g., hourly and daily). To save on storage, we limit intraday Hive tables to short retention (three days), and keep daily tables for longer retention (several years).
随着分区数量的增加,Hive的后台DBMS的负载已经成为一个瓶颈,分区操作的负载也是如此(例如,为一个月的数据查询成千上万的分区)。作为一种变通方法,我们通常会增加一个每日汇总的阶段,并为...