Shopify通向更快的Trino查询执行的道路。自定义验证、基准测试和剖析工具

Data scientists at Shopify expect fast results when querying large datasets across multiple data sources. We use Trino (a distributed SQL query engine) to provide quick access to our data lake and recently, we’ve invested in speeding up our query execution time.

Shopify的数据科学家们期望在跨多个数据源查询大型数据集时能得到快速的结果。我们使用Trino(一个分布式SQL查询引擎)来提供对我们的数据湖的快速访问,最近,我们投资于加快查询执行时间

On top of handling over 500 Gbps of data, we strive to deliver p95 query results in five seconds or less. To achieve this, we’re constantly tuning our infrastructure. But with each change comes a risk to our system. A disruptive change could stall the work of our data scientists and interrupt our engineers on call.

在处理超过500Gbps的数据的基础上,我们努力在5秒或更短时间内提供p95查询结果。为了实现这一目标,我们不断调整我们的基础设施。但每一次改变都会给我们的系统带来风险。一个破坏性的变化可能使我们的数据科学家的工作停滞不前,并打断我们的工程师的工作。

That’s why Shopify’s Data Reliability team built custom verification, benchmarking, and profiling tooling for testing and analyzing Trino. Our tooling is designed to minimize the risk of various changes at scale. 

这就是为什么Shopify的数据可靠性团队建立了定制的验证、基准测试和剖析工具来测试和分析Trino。我们的工具设计是为了最大限度地减少各种规模的变化的风险。

Below we’ll walk you through how we developed our tooling. We’ll share simple concepts to use in your own Trino deployment or any other complex system involving frequent iterations.

下面我们将向你介绍我们如何开发我们的工具。我们将分享简单的概念,以用于你自己的Trino部署或任何其他涉及频繁迭代的复杂系统。

The Problem

问题所在

A diagram showing the Trino upgrade tasks over time: Merge update to trino, Deploy candidate cluster, Run through Trino upgrade checklist, and Promote candidate to Prod. The steps include two places to Roll back.

Trino Upgrade Tasks Over Time

特里诺的升级任务随时间推移而变化

As Shopify grows, so does our data and our team of data scientists. To handle the increasing volume, we’ve scaled our Trino cluster to hundreds of nodes and tens of thousands of virtual CPUs.

随着Shopify的发展,我们的数据我们的数据科学家团队也在增长。为了处理日益增长的数据量,我们已经将我们的Trino集群扩展到数百个节点和数万个虚拟CPU。

Managing our cluster gives way to two main concerns:

管理我们的集群有两个主要问题。

  1. Optimizations: We typically have several experiments on the go for optimizing some aspect of our configuration and infrastructure.
  2. 优化。我们通常有几个实验在进行中,以优化我们的配置和基础设施的某些方面。
  3. Software updates: We must keep up-to-date with new Trino features and security patches.
  4. 软件更新。我们必须及时了解Trino的新功能和安全补丁的最新情况。

Both of ...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.124.0. UTC+08:00, 2024-04-28 14:18
浙ICP备14020137号-1 $访客地图$