在Pinterest用Apache Spark SQL进行交互式查询
Sanchay Javeria | Software Engineer, Big Data Query Platform, Data Engineering
Sanchay Javeria | 软件工程师,大数据查询平台,数据工程
Ashish Singh | Technical Lead, Big Data Query Platform, Data Engineering
Ashish Singh | 大数据查询平台技术负责人,数据工程
To achieve our mission of bringing everyone inspiration through our visual discovery engine, Pinterest relies heavily on making data-driven decisions to improve the Pinner experience for over 475 million monthly active users. Reliable, fast, and scalable interactive querying is essential to make those data-driven decisions possible. In the past, we published how Presto at Pinterest serves this function. Here, we’ll share how we built a scalable, reliable, and efficient interactive querying platform that processes hundreds of petabytes of data daily with Apache Spark SQL. Through an elaborate discussion on various architecture choices, challenges along the way, and our solutions for those challenges, we share how we made interactive querying with Spark SQL a success.
为了实现我们的使命,即通过我们的视觉发现引擎为每个人带来灵感,Pinterest在很大程度上依靠数据驱动的决策来改善超过4.75亿月活跃用户的Pinner体验。可靠、快速、可扩展的交互式查询对于实现这些数据驱动的决策至关重要。在过去,我们公布了Pinterest的Presto是如何实现这一功能的。在这里,我们将分享我们如何建立一个可扩展的、可靠的、高效的互动查询平台,每天用Apache Spark SQL处理数百PB的数据。通过对各种架构选择、一路走来的挑战以及我们对这些挑战的解决方案的详细讨论,我们将分享我们是如何用Spark SQL实现交互式查询的成功。
Scheduled vs. Interactive Querying
预定查询与互动查询
Querying is the most popular way for users to derive understanding from data at Pinterest. The applications of such analysis exist in all business/engineering functions like Machine Learning, Ads, Search, Home Feed Recommendations, Trust & Safety, and so on. There are primarily two ways to submit these queries: scheduled and interactive.
查询是用户在Pinterest上从数据中获得理解的最流行方式。这种分析的应用存在于所有商业/工程功能中,如机器学习、广告、搜索、首页推荐、信任与安全等等。提交这些查询的方式主要有两种:预定和互动。
- Scheduled Queries are queries that run on a pre-defined cadence. These queries usually have strict Service Level Objectives (SLO).
- 预定查询是按预先定义的节奏运行的查询。这些查询通常有严格的服务水平目标(SLO)。
- Interactive Queries are queries that are executed when needed and are usually not repeated on a pre-defined ...