理解Parquet文件格式
This is part of a series of related posts on Apache Arrow. Other posts in the series are:
这是关于Apache Arrow的一系列相关帖子的一部分。该系列中的其他帖子包括:
Apache Parquet is a popular column storage file format used by Hadoop systems, such as Pig, Spark, and Hive. The file format is language independent and has a binary representation. Parquet is used to efficiently store large data sets and has the extension .parquet
. This blog post aims to understand how parquet works and the tricks it uses to efficiently store data.
Apache Parquet 是一种流行的列存储文件格式,广泛用于 Hadoop 系统,如 Pig、Spark 和 Hive。该文件格式与语言无关,并具有二进制表示。Parquet 用于高效存储大型数据集,扩展名为 .parquet
。这篇博客文章旨在理解 parquet 的工作原理及其用于高效存储数据的技巧。
Key features of parquet are:
parquet 的主要特点包括:
- it’s cross platform
- 它是跨平台的
- it’s a recognised file format used by many systems
- 这是一个被许多系统认可的文件格式
- it stores data in a column layout
- 它以列布局存储数据
- it stores metadata
- 它存储元数据
The latter two points allow for efficient storage and querying of data.
后两点允许高效存储和查询数据。
Column Storage
列存储
Suppose we have a simple data frame:
假设我们有一个简单的数据框:
tibble::tibble(id = 1:3, name = c("n1", "n2", "n3"), age = c(20, 35, 62))
#> # A tibble: 3 × 3
#> id name age
#> <int> <chr> <dbl>
#> 1 1 n1 20
#> 2 2 n2 35
#> 3 3 n3 62
If we stored this data set as a CSV file, what we see in the R terminal is mirrored in the file storage format. This is row storage. This is efficient for file queries such as,
如果我们将这个数据集存储为CSV文件,那么在R终端中看到的内容在文件存储格式中是镜像的。这是行存储。这对于文件查询是高效的,例如,
SELECT * FROM table_name WHERE id == 2
We simply go to the 2nd row and retrieve that data. It’s also very easy to append rows to the data set - we just add a row to the bottom of the file. However, if we want to sum the data in the age
column, then this is potentially inefficient. We would need to determine which value on each row is related to age
, and extract that value.
我们只需转到第二行并检索该数据。将行附加到数据集也非常简单 - 我们只需在文件底部添加一行。然而,如果我们想对age
列中的数据求和,那么这可能效率不高。我们需要确定每行中与age
相关的值,并提取该值。
Parquet uses column storage. In column layouts, column data are stored sequentially.
Parquet 使用列存储。在列布局中,列数据是顺序存储的。
With this layout, q...