Fast Copy-On-Write within Apache Parquet for Data Lakehouse ACID Upserts

기사
문서
책
앨범

Fast Copy-On-Write within Apache Parquet for Data Lakehouse ACID Upserts

出处：www.uber.com

摘要

With the evolution of storage table formats Apache Hudi®, Apache Iceberg®, and Delta Lake™, more and more companies are building up their lakehouse on top of these formats for many use cases, like incremental ingestion. But the speed of upserts sometimes is still a problem when the data volumes go up.

In storage tables, Apache Parquet is used as the main file format. In this article, we will discuss how we built a row-level secondary index and the innovations we introduced in Apache Parquet to speed up the upsert data inside a Parquet file. We will also demonstrate benchmarking results that show much faster speeds than traditional copy-on-write in Delta Lake and Hudi.

阅读原文

xiaozi 于 2023-07-01 分享

3119

关联话题： #Uber

欢迎在评论区写下你对这篇文章的看法。

Fast Copy-On-Write within Apache Parquet for Data Lakehouse ACID Upserts

Fast Copy-On-Write within Apache Parquet for Data Lakehouse ACID Upserts

摘要

评论

文库