大数据文件格式中的成本效率@规模

Cost Efficiency @ Scale in Big Data File Format

Background

背景介绍

Our Apache Hadoop® based data platform ingests hundreds of petabytes of analytical data with minimum latency and stores it in a data lake built on top of the Hadoop Distributed File System (HDFS). We use Apache Hudi™ as our ingestion table format and Apache Parquet™ as the underlying file format. Our data platform leverages Apache Hive™, Apache Presto™, and Apache Spark™ for both interactive and long-running queries, serving the myriad needs of different teams at Uber.

我们基于Apache Hadoop®的数据平台以最小的延迟摄取数百PB的分析数据,并将其存储在一个建立在Hadoop分布式文件系统(HDFS)之上的数据湖。我们使用ApacheHudi™作为我们的摄入表格式,使用ApacheParquet™作为底层文件格式。我们的数据平台利用ApacheHive™、ApachePresto™和ApacheSpark™进行交互式和长时间的查询,以满足Uber不同团队的无数需求。

Uber’s growth over the last few years exponentially increased both the volume of data and the associated access loads required to process it. As data volume grows, so do the associated storage and compute costs, resulting in growing hardware purchasing requirements, higher resource usage, and even causing out-of-memory (OOM) or high GC pause. The main goal of this blog is to address storage cost efficiency issues, but the side benefits also include CPU, IO, and network consumption usage. 

Uber在过去几年的增长,使数据量和处理数据所需的相关访问负载都呈指数增长。随着数据量的增长,相关的存储和计算成本也在增长,导致硬件采购需求不断增加,资源使用率提高,甚至造成内存不足(OOM)或高GC暂停。本博客的主要目标是解决存储成本效率问题,但附带的好处还包括CPU、IO和网络消耗使用。

We started several initiatives to reduce storage cost, including setting TTL (Time to Live) to old partitions, moving data from hot/warm to cold storage, and reducing data size in the file format level. In this blog, we will focus on reducing the data size in storage at the file format level, essentially at Parquet. 

我们开始了几项降低存储成本的举措,包括为旧分区设置TTL(Time to Live),将数据从热/热存储转移到冷存储,以及在文件格式层面减少数据大小。在这篇博客中,我们将重点讨论在文件格式层面上减少存储中的数据大小,基本上是在Parquet。

Apache Parquet™at Uber 

在Uber的Apache Parquet™。

Uber data is ingested into HDFS and registered as either raw or modeled tables, mainly in the Parquet format and with a small portion in the ORC file format. Our initiatives and the discussion in this blog are around Par...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.124.0. UTC+08:00, 2024-05-03 09:42
浙ICP备14020137号-1 $访客地图$