使用S3读取优化来提高效率和减少运行时间

Bhalchandra Pandit | Software Engineer

巴尔昌德拉-潘迪特 | 软件工程师

Overview

概述

We describe a novel approach we took to improving S3 read throughput and how we used it to improve the efficiency of our production jobs. The results have been very encouraging. A standalone benchmark showed a 12x improvement in S3 read throughput (from 21 MB/s to 269 MB/s). Increased throughput allowed our production jobs to finish sooner. As a result, we saw 22% reduction in vcore-hours, 23% reduction in memory-hours, and similar reduction in run time of a typical production job. Although we are happy with the results, we are exploring additional enhancements in the future. They are briefly described at the end of this blog.

我们描述了我们为提高S3读取量而采取的一种新方法,以及我们如何用它来提高生产作业的效率。结果是非常令人鼓舞的。一个独立的基准测试显示,S3的读取吞吐量提高了12倍(从21MB/s到269MB/s)。吞吐量的增加使我们的生产作业能够更快完成。因此,我们看到vcore-hours减少了22%,内存-hours减少了23%,典型生产作业的运行时间也有类似的减少。尽管我们对结果感到满意,但我们正在探索未来的额外增强措施。在这篇博客的末尾,我们将简要介绍这些改进。

Motivation

激励

We process petabytes of data stored on Amazon S3 every day. If we inspect the relevant metrics of our MapReduce/Cascading/Scalding jobs, one thing stands out: slower than expected mapper speed. In most cases, the observed mapper speed is around 5–7 MB/sec. That speed is orders of magnitude slower compared to the observed throughput of commands such as aws s3 cp, where speeds of around 200+ MB/sec are common (observed on a c5.4xlarge instance in EC2). If we can increase the speed at which our jobs read data, our jobs will finish sooner and save us considerable time and money in the process. Given that processing is costly, these savings can add up quickly to a substantial amount.

我们每天处理存储在Amazon S3上的PB级的数据。如果我们检查我们的MapReduce/Cascading/Scalding作业的相关指标,有一件事很突出:比预期的映射器速度要慢。在大多数情况下,观察到的映射器速度大约是5-7MB/秒。这个速度与aws s3 cp等命令的吞吐量相比慢了好几个数量级,后者的速度普遍在200MB/秒以上(在EC2的c5.4xlarge实例上观察到)。如果我们能提高作业读取数据的速度,我们的作业将更快完成,并在这个过程中为我们节省大量的时间和金钱。鉴于处理的成本很高,这些节省的费用可以迅速增加到一个可观的数额。

S3 read optimization

S3读取优化

If we inspect implementation of the S3AInputStream, it is easy to notice the following potential areas of improv...

开通本站会员,查看完整译文。

ホーム - Wiki
Copyright © 2011-2024 iteam. Current version is 2.129.0. UTC+08:00, 2024-07-02 01:27
浙ICP备14020137号-1 $お客様$