利用Thrift的部分反序列化提高数据处理效率

Bhalchandra Pandit | Software Engineer

巴尔昌德拉-潘迪特 | 软件工程师

At Pinterest we’ve worked to greatly improve data processing efficiency. One quote that resonates with our unique approach is from writer Antoine de Saint-Exupéry: “Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.”

在Pinterest,我们一直致力于大大提升数据处理效率。有一句话与我们的独特方法产生了共鸣,它来自作家安托万-德-圣-埃克苏佩里。"完美的实现,不是当没有什么可以增加的时候,而是当没有什么可以拿走的时候"。

Ultimately, we process petabytes of Thrift encoded data at Pinterest. Most jobs that access this data need only a part of it. To meet our unique needs, we devised a way to efficiently deserialize only the desired subsets of Thrift structures in each job. Our solution enabled us to significantly decrease our data processing resource usage: about 20% reduction in vcore usage, 27% reduction in memory usage, and 36% reduction in intermediate data (mapper output).

最终,我们在Pinterest处理PB级的Thrift编码数据。大多数访问这些数据的工作只需要其中的一部分。为了满足我们独特的需求,我们设计了一种方法,在每个作业中只有效地反序列化所需的Thrift结构子集。我们的解决方案使我们能够大大减少数据处理资源的使用:vcore的使用减少了20%,内存的使用减少了27%,中间数据(映射器输出)减少了36%。

Motivation

激励

Pinterest is a data driven company. We use data for everything that matters to the businesses — including analytics, machine learning, experimentation.

Pinterest是一家数据驱动的公司。我们将数据用于对企业重要的一切--包括分析、机器学习、实验。

We have many large datasets encoded in the Apache Thrift format, which is tens of Petabytes in size. We have a large number of offline data processing workflows that process those datasets every day. A significant portion of that processing involves deserializing Thrift structures from serialized bytes on disk. It costs us significant amounts of time and money to perform such deserialization. Any optimization that reduces this deserialization cost would yield considerable savings.

我们有许多以Apache Thrift格式编码的大型数据集,其大小为几十PB。我们有大量的离线数据处理工作流程,每天都在处理这些数据集。其中很大一部分处理涉及到从磁盘上的序列化字节反序列化Thrift结构。我们需要花费大量的时间和金钱来执行这种反序列化。任何能够减少这种反序列化成本的优化都会产生可观的节约。

No suitable alternatives

没有合适的替代品

Even though many data processing jobs require a subset of the fields of a Thrift str...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.137.1. UTC+08:00, 2024-11-22 11:29
浙ICP备14020137号-1 $访客地图$