Preon:用于智能高效分析的Presto查询分析
Presto™ is an open source SQL query engine used on a large scale at Uber. Uber has around 20+ Presto clusters comprising over 12,000 hosts. We have about 7,000 weekly users and run about half a million queries per day. Presto has various use cases at Uber like ad hoc interactive analytics, ETL and batch workloads, dashboarding, data quality checks, report generation, experimentation, and data-driven services. Due to the scale of the system, there are various opportunities to make it more efficient. However, these opportunities need intelligence regarding the queries being processed by the system.
Presto™ 是 Uber 上大规模使用的开源 SQL 查询引擎。Uber 拥有约 20 个 Presto 集群,包括超过 12,000 台主机。我们每周有约 7,000 个用户,并每天运行约 50 万个查询。Presto 在 Uber 中有各种用途,如即席交互式分析、ETL 和批处理工作负载、仪表盘、数据质量检查、报告生成、实验和数据驱动服务。由于系统规模的原因,有各种机会使其更高效。然而,这些机会需要了解系统正在处理的查询的智能信息。
Presto is a query engine and provides an SQL interface for running queries, but there are many cases where we need to be able to analyze queries in order to get specific actionable insights. Some examples are:
Presto是一个查询引擎,提供了一个SQL接口来运行查询,但有许多情况下我们需要能够分析查询以获得具体的可操作见解。一些例子包括:
- To be able to analyze predicates used to query tables – This can be used to reformat (sort, bucket, partition) the tables on those columns leading to less data being read from the backend during query execution and faster and more efficient queries.
- 能够分析用于查询表的谓词 - 这可以用于重新格式化(排序、分桶、分区)这些列上的表,从而在查询执行期间从后端读取更少的数据,实现更快速和更高效的查询。
- To be able to ascertain the tables/columns being read/written in the query – This can be used to route queries to specific clusters based on availability of those tables in specific regions of the Uber data lake or to do permissions checks.
- 能够确定查询中读取/写入的表/列 - 这可以用于根据Uber数据湖特定区域中这些表的可用性将查询路由到特定集群或进行权限检查。
- To determine the type of a query – For example, DDL/DML to route ETL queries to certain clusters.
- 确定查询的类型 - 例如,DDL/DML以将ETL查询路由到特定的集群。
- To ascertain what is the most recent modification time of any of the datasets being queried – This can be used to ascertain if results from a previous run of the query are...