如何用分区来优化你的Apache Spark应用程序
In Salesforce Einstein, we use Apache Spark to perform parallel computations on large sets of data, in a distributed manner. In this article, we will take a deep dive into how you can optimize your Spark application with partitions.
在Salesforce Einstein中,我们使用Apache Spark以分布式方式对大型数据集进行并行计算。在这篇文章中,我们将深入探讨如何用分区来优化你的Spark应用程序。
Introduction
简介
Today, we often need to process terabytes of data per day to reach conclusions. To do this in an acceptable timeframe, we need to perform certain computations on that data in parallel.
今天,我们经常需要每天处理数兆字节的数据来得出结论。为了在可接受的时间范围内做到这一点,我们需要在这些数据上并行地进行某些计算。
Parallelizing on one machine will always have limits, no matter how big the machine is. Our true factor of parallelism is determined by the number of cores on our machine (128 is the maximum today on a single machine). So instead, we want to be able to distribute the load to multiple machines. That way, we can use a fleet of commodity hardware and reach an “infinite” parallelism factor (we can always add new machines).
在一台机器上的并行化总是有限制的,不管机器有多大。我们真正的并行因素是由我们机器上的核心数量决定的(128是今天单台机器上的最大值)。因此,相反,我们希望能够将负载分配给多台机器。这样,我们就可以使用一个商品硬件舰队,达到一个 "无限 "的并行系数(我们可以随时增加新的机器)。
Spark is a distributed processing system that helps us distribute the load on multiple machines, without the overhead of syncing them and managing errors for each. But the thing is, it’s not Spark’s job to decide how best to distribute the load. It keeps default configuration to help us get started, but those are not enough — relying on the defaults can lead to a 70% performance gap when not tuning our application, as we will see in our example later on. Our job is to tell Spark exactly how we want to distribute the load on our dataset. To do so, we must learn and understand the concept of partitions.
Spark是一个分布式处理系统,它帮助我们将负载分布在多台机器上,而不需要同步它们和管理每台机器的错误的开销。但问题是,决定如何最好地分配负载并不是Spark的工作。它保留了默认的配置来帮助我们开始工作,但这些是不够的--在不调整我们的应用程序时,依赖默认的配置会导致70%的性能差距,正如我们在后面的例子中看到的那样。我们的工作是准确地告诉Spark我们要如何分配我们数据集上的负载。要做到这一点,我们必须学习和理解分区的概念。
What is a partition?
什么是分区?
Let’s start off with an example. Say we hav...