Uber的高度可扩展和分布式洗牌服务
Uber is a data-driven company that heavily relies on offline and online analytics for decision-making. As Uber’s data grows exponentially every year, it’s crucial to process this data very efficiently and with minimum cost. Over the years, Apache Spark™ has become the primary compute engine at Uber to satisfy such data needs. Spark empowers many business-critical use cases at Uber with its unique features, including Uber rides, Uber Eats, autonomous vehicles, ETAs, Maps, and many more. Spark is the primary engine for data warehousing, data science, and AI/ML. In the last few years, Uber’s Spark usage has grown exponentially year over year, running on more than 10,000 nodes in production. Spark jobs now account for more than 95% of analytics cluster compute resources which process hundreds of petabytes of data every day.
Uber是一家以数据为导向的公司,严重依赖线下和线上的分析进行决策。由于Uber的数据每年都在成倍增长,因此以最低的成本非常有效地处理这些数据是至关重要的。多年来,Apache Spark™已经成为Uber的主要计算引擎,以满足这种数据需求。Spark以其独特的功能为Uber的许多关键业务用例提供支持,包括Uber乘车、Uber Eats、自动驾驶车辆、ETA、地图等等。Spark是数据仓库、数据科学和AI/ML的主要引擎。在过去的几年里,Uber的Spark使用量逐年成倍增长,在生产中运行的节点超过10,000个。现在,Spark作业占分析集群计算资源的95%以上,每天处理数百PB的数据。
Although Apache Spark has many benefits that contribute to its popularity at Uber and in the industry, we’ve still experienced several challenges in operating Spark at our scale. A Spark job consists of multiple stages in the current architecture, and shuffle is a well-known methodology to transfer the data between two stages. As outlined in our Spark + AI Summit 2020 talk, currently, shuffle is being done locally in each machine, which poses reliability and stability issues, and other challenges such as hardware stability, compute resource management, and user productivity. We will focus on Spark shuffle scalability challenges in this blog post. We propose a new Remote Shuffle Service, codenamed RSS, which will move the shuffle from local to remote machines. RSS will force all local disk writes to a remote shuffle cluster, allowing us to be more efficient in computing data in base machines and remo...