我们在Pinterest上如何构建Text-to-SQL
Adam Obeng | Data Scientist, Data Platform Science; J.C. Zhong | Tech Lead, Analytics Platform; Charlie Gu | Sr. Manager, Engineering
Adam Obeng | 数据科学家,数据平台科学; J.C. Zhong | 技术负责人,分析平台; Charlie Gu | 高级经理,工程
Writing queries to solve analytical problems is the core task for Pinterest’s data users. However, finding the right data and translating an analytical problem into correct and efficient SQL code can be challenging tasks in a fast-paced environment with significant amounts of data spread across different domains.
编写查询以解决分析问题是Pinterest数据用户的核心任务。然而,在快节奏的环境中,找到正确的数据并将分析问题转化为正确高效的SQL代码可能是具有挑战性的任务,因为数据分布在不同的领域中。
We took the rise in availability of Large Language Models (LLMs) as an opportunity to explore whether we could assist our data users with this task by developing a Text-to-SQL feature which transforms these analytical questions directly into code.
我们利用大型语言模型(LLM)的可用性提升作为一个机会,通过开发一个将这些分析问题直接转化为代码的Text-to-SQL功能来帮助我们的数据用户。
How Text-to-SQL works at Pinterest
在Pinterest上的Text-to-SQL工作原理
Most data analysis at Pinterest happens through Querybook, our in–house open source big data SQL query tool. This tool is the natural place for us to develop and deploy new features to assist our data users, including Text-to-SQL.
在Pinterest上,大多数数据分析都是通过我们的内部开源大数据SQL查询工具Querybook进行的。这个工具是我们开发和部署新功能以帮助我们的数据用户的自然选择,包括Text-to-SQL。
Implementing Text-to-SQL
实现文本到SQL
The first version incorporated a straightforward Text-to-SQL solution utilizing an LLM. Let’s take a closer look at its architecture:
第一个版本采用了一个简单的文本到SQL解决方案,利用了LLM。让我们更详细地看一下它的架构:
The user asks an analytical question, choosing the tables to be used.
用户提出一个分析问题,选择要使用的表格。
- The relevant table schemas are retrieved from the table metadata store.
- 从表元数据存储中检索相关的表模式。
- The question, selected SQL dialect, and table schemas are compiled into a Text-to-SQL prompt.
- 问题、选择的SQL方言和表格模式被编译成Text-to-SQL提示。
- The prompt is fed into the LLM.
- 将提示输入LLM。
- A streaming response is generated and displayed to the user.
- 生成并向用户显示流式响应。
Table Schema
表格模式
The table schema acquired from the metadata store includes:
从元数...