关于Apache Kafka®的Presto®在Uber规模上的应用

Presto® on Apache Kafka® At Uber Scale

Uber’s goal is to ignite opportunity by setting the world in motion, and big data is a very important part of that. Presto® and Apache Kafka® play critical roles in Uber’s big data stack. Presto is the de facto standard for query federation that has been used for interactive queries, near-real-time data analysis, and large-scale data analysis. Kafka is the backbone for data streaming that supports many use cases such as pub/sub, streaming processing, etc. In the following article we will discuss how we have connected these two important services together to enable a lightweight, interactive SQL query directly over Kafka via Presto at Uber scale.

Uber的目标是通过让世界运转起来来点燃机会，而大数据是其中非常重要的一部分。Presto®和Apache Kafka®在Uber的大数据堆栈中发挥着关键作用。Presto是查询联盟的事实标准，已被用于互动查询、近实时数据分析和大规模数据分析。Kafka是数据流的骨干，支持许多用例，如pub/sub、流处理等。在下面的文章中，我们将讨论如何将这两个重要的服务连接在一起，通过Presto在Uber规模的Kafka上直接实现轻量级、交互式SQL查询。

Figure 1: Big Data Stack At Uber

图1：Uber的大数据栈

Presto at Uber

优步公司的Presto

Uber uses open source Presto to query nearly every data source, both in motion and at rest. Presto’s versatility empowers us to make smart, data-driven business decisions. We operate around 15 Presto clusters spanning more than 5,000 nodes. We have around 7,000 weekly active users running approximately 500,000 queries daily, which read around 50 PB from HDFS. Today, Presto is used to query a variety of data sources like Apache Hive™, Apache Pinot™, AresDb, MySQL, Elasticsearch, and Apache Kafka, through its extensible data source connectors. You can also find more information about Presto in some of our previous blogs:

Uber使用开源的Presto来查询几乎所有的数据源，包括运动中和静止时。Presto的多功能性使我们能够做出智能的、数据驱动的商业决策。我们运营着大约15个Presto集群，跨越5000多个节点。我们有大约7000名每周活跃的用户，每天运行大约50万次查询，从HDFS读取大约50PB的数据。今天，Presto被用来查询各种数据源，如Apache Hive™、Apache Pinot™、AresDb、MySQL、Elasticsearch和Apache Kafka，通过其可扩展的数据源连接器。你也可以在我们之前的一些博客中找到更多关于Presto的信息。

Engineering Data Analytics with Presto and Apache Parquet at Uber
Building a Better Big Data Architecture: Meet Uber’s Presto Team

在Uber使用Presto和Apache Parquet的工程数据分析
 构建一个更好的大数据架构。构建更好的大数...