Practical text-to-SQL for data analytics

Artificial intelligence

Co-authors: Co-authored byAlbert Chen, Co-authored byManas Bundele, Co-authored byGaurav Ahlawat, Co-authored byPatrick Stetz, Co-authored byZhitao (James) W., Co-authored byQiang Fei, Co-authored byDonghoon (Don) Jung, Co-authored byAudrey Chu, Co-authored byBharadwaj Jayaraman, Co-authored byAyushi Panth, Co-authored byYatin Arora, Co-authored bySourav Jain, Co-authored byRenjith Varma, Co-authored byAlex Ilin, Co-authored byIuliia Melnychuk 🇺🇦, Co-authored byChelsea C., Co-authored byJoyan Sil, and Co-authored byXiaofeng Wang

In most tech companies, data experts spend a significant amount of their time helping colleagues find data they need – time that could be spent on complex analysis and strategic initiatives. This bottleneck not only frustrates data teams but also creates delays for business partners waiting for crucial insights.

Generative AI presents an opportunity to improve this workflow. As part of our data democratization efforts at LinkedIn, we've developed SQL Bot, an AI-powered assistant integrated within our DARWIN data science platform. This internal tool transforms natural language questions into SQL: it finds the right tables, writes queries, fixes errors, and enables employees across functions to independently access the data insights they need under the appropriate permissions. Behind the scenes, SQL Bot is a multi-agent system built on top of LangChain and LangGraph.

Typical user interaction with SQL Bot

Figure 1: Typical user interaction with SQL Bot

While creating a proof of concept for a Text-to-SQL tool is straightforward, the challenge lies in navigating complex enterprise data warehouses to identify authoritative data sources that accurately answer user questions. In this post, we share key strategies that enabled us to deploy a practical text-to-SQL solution, now utilized by hundreds of employees across LinkedIn’s diverse business verticals.

Strategy #1: Quality table metadata and personalized retrieval

Text-to-SQL is often framed as a Retrieval-Augmented Generation (RAG) application, where context such as table schemas, example queries, and other domain knowledge are retrieved and passed to a Large Language Model (LLM) to answer the question.

We use Embedding-Based Retrieval (EBR) to retrieve context semantically relevant to the user’s question. One challenge in retrieving tables and fields is the frequent absence or incompleteness of descriptions. To address this, we initiated a dataset certification effort to collect comprehensive descriptions for hundreds of important tables. Domain experts identified key tables within their areas and provided mandatory table descriptions and optional field descriptions. These descriptions were augmented with AI-generated annotations based on existing documentation and Slack discussions, further enhancing our ability to retrieve the right tables and use them properly in queries.

Another challenge is the sheer volume of tables—at LinkedIn, it’s in the millions—and the implicit context embedded in user questions. We can quickly narrow down the volume of tables to a few thousand by looking at access popularity. However, addressing the implicit context is more subtle. For instance, the question "What was the average CTR yesterday?" should be answered differently depending on whether the employee is interested in email notifications, ads or search quality. To address this, we infer the default datasets for a user based on the organizational chart. We also apply Independent Component Analysis (ICA) across user-dataset access history to develop components (sets of datasets) that correspond to different business use cases. Results are personalized by using the top components relevant to each user. Users are able to change the default filter values if needed.

In practice, tables and fields are deprecated or added over time. It is typical for a table to be used for a few years before it is replaced with another table with improved performance, schema, and/or logic. Thus, the source of truth to answer a question can change. To automate the process of picking up new tables, we automatically ingest popular queried tables into our vector store. DataHub, our metadata search and discovery tool, allows users to mark datasets and fields as deprecated – we use this signal to automatically offboard datasets and fields.

Strategy #2: Knowledge graph and LLMs for ranking, writing, self-correction

The output of the first strategy is a candidate list of tables, selected by filtering and EBR. In this section, we outline a few approaches that have helped us generate accurate queries.

Knowledge graph. Users, table clusters, tables, and fields are nodes. The nodes have attributes derived from DataHub, query logs, crowdsourced domain knowledge, etc.

Figure 2: Knowledge graph. Users, table clusters, tables, and fields are nodes. The nodes have attributes derived from DataHub, query logs, crowdsourced domain knowledge, etc.

First, we need a deep semantic understanding of concepts and datasets to generate accurate queries. In addition to leveraging tables and fields, we organize additional information into a knowledge graph:

We use DataHub to look up table schemas, field descriptions, the top K values for categorical dimension fields, partition keys, and a classification of fields into metrics, dimensions, and attributes.
We collect domain knowledge from users in SQL Bot’s UI.
We use successful queries from query logs to derive aggregate information, such as table/field popularity and common table joins.
We incorporate example queries from internal wikis and from notebooks in DARWIN. Because the quality of code in DARWIN can vary, we only include notebooks certified by users and those that meet a set of heuristics for recency and reliability – for instance, we prefer recently created notebooks titled by users that have a high number of executions.

Then, we use LLMs to filter and sort the results from EBR using the knowledge graph. After retrieving the top 20 tables via EBR, we employ a LLM re-ranker to select the top 7 tables for query writing. The inputs to table selection include table descriptions, example queries, domain knowledge, and explanations of internal jargon detected in the user’s question. We use another LLM re-ranker to select fields from the selected tables. The input to field selection includes the information used for table selection, plus the full table schemas with field descriptions, top K values, and other field attributes. Fields are ordered by access frequency over a recent time window.

After that, our query writing process is iterative, so that SQL Bot generates a plan and solves each step of the plan incrementally to build to the final query. Solutions to previous tasks are stored in our chatbot’s internal state to be provided to the next step. While this method is effective for complex questions, we found it can result in overly complicated queries for simple questions, so we instruct the query planner to minimize the number of steps it creates. This condenses queries for simple questions while maintaining performance on complex questions.

Finally, we run a set of validators on the output followed by a self-correction agent to fix errors. Validators work best when they access new information not available to the query writer. We verify the existence of tables and fields, and execute the EXPLAIN statement on the query to detect syntax and other errors. These errors are fed into a self-correction agent, which is equipped with tools to retrieve additional tables or fields if needed before updating the query.

Modeling architecture. The user’s question is classified and delegated to the appropriate flow. Open-ended follow-up chats are handled by an agent.

Figure 3: Modeling architecture. The user’s question is classified and delegated to the appropriate flow. Open-ended follow-up chats are handled by an agent.

User experience is central to gaining adoption. Users prioritize ease of use and fast responses.

We integrated SQL Bot directly into DARWIN, allowing users to access it within the same browser window as they write and execute their queries. This integration increased adoption by 5-10x compared to our prototype chatbot application launched as a standalone app. To aid discovery, DARWIN has entry points for SQL Bot in the sidebar, as well as a “Fix with AI” button that appears whenever a query execution fails. Chat history is saved so users can continue previous conversations. They can also submit in-product feedback or add custom instructions to personalize the bot’s behavior.

Our initial prototype answered every question with a SQL query, but users actually wanted to find tables, ask questions about the datasets, see reference queries, or ask general questions about query syntax. We now use intent classification to classify the question and decide how to respond.

It’s essential for a chatbot to be conversational so that users can ask follow-up questions. We provide “quick replies” such as “update query,” “update table selections,” and “explain these tables” to guide users on the types of follow-ups they could try. Additionally, users have the option to enable a guided experience, where SQL Bot walks them through each step of the query writing process—finding tables, and solving each step in the query. The user can interact at each step to provide feedback on the table or queries. This helps users understand the query and gives them more control over the query-writing process.

Rich display elements in the UI help users understand recommended tables.

Figure 4: Rich display elements in the UI help users understand recommended tables.

To help users understand the bot’s output, we have incorporated rich display elements for tables and queries. The table element shows the retrieved tables, their descriptions, tags indicating whether the dataset is “certified” or “popular”, average monthly access frequency, commonly joined tables, and a link to DataHub for more information. In the guided experience, users may use these checkboxes to select the tables they want the bot to use. The query element displays the formatted query, explanation, and validation checks on whether the tables exist, fields exist, and the syntax is correct. This helps users understand the output and identify any issues that need fixing. They can ask the bot to make updates.

Query output includes validation checks, explanation, and tables.

Figure 5: Query output includes validation checks, explanation, and tables.

Each dataset at LinkedIn has its own access control list which permits dataset read access to only specific users or groups. To prevent issues where a user runs a query and is denied access, we check if the user is a member of a group with the appropriate access and if so, we automatically provide the code necessary to leverage the group's credentials. This reduces frustration for the user, especially for those new to SQL at LinkedIn.

Strategy #4: Options for user customization

We want users to have the ability to improve SQL Bot’s performance without making requests to the platform team. To this end, we provide three levers that allow users to customize the experience for their product area:

Dataset customization: Users can define the datasets for a product area by providing email group(s) or by explicitly specifying the users and datasets to use. The product area’s datasets are those commonly used by the group of users in that area, with the option to include or exclude additional datasets as specified. Users can select product areas to use through the UI.
Custom instructions: Users can provide custom textual instructions to SQL Bot directly in DARWIN. The instructions can either enrich the overall domain knowledge of SQL Bot or provide guidelines for SQL Bot to behave in a specific manner to match user preferences. The user-supplied instructions are used when selecting tables and fields, and when writing and fixing queries.
Example queries: Users can create example queries to be indexed into our vector store. These can be added directly in DARWIN by creating a notebook and tagging it as “certified.”

Strategy #5: Ongoing Benchmarking

There are many hyperparameters for the bot, such as what text to embed, what context to pass to each LLM, how to represent the context and meta prompts, how to manage agent memory, how many items to retrieve, how many times to run self-correction, and which steps to include in the model architecture. Therefore, it is crucial to develop a benchmark set to assess both quality and performance.

A benchmark should preferably be tailored to the specific application, as text-to-SQL requirements can vary widely depending on factors like the target user, number of datasets, the clarity of table and column names, the complexity of the desired queries, the SQL dialect, target response time, and the degree of specialized domain knowledge required. We collaborated with domain experts across 10 product areas to define a set of over 130 benchmark questions. Each question includes a well-formulated question and ground truth answers.

Our evaluation metrics include recall of tables and fields compared to the ground truth, table/field hallucination rate, syntax correctness, and response latency. These are easy to compute and we focused on these during the first phase of development while we worked on finding the right tables/fields and avoiding obvious query issues.

For example, this chart shows the increase in table recall from adding re-rankers, descriptions, and example queries:

Re-rankers, descriptions, and example queries help SQL Bot identify the correct tables.

Figure 6: Re-rankers, descriptions, and example queries help SQL Bot identify the correct tables.

However, those metrics are not sufficient to assess query accuracy. For that, we use a combination of human evaluation and LLM-as-a-judge to evaluate responses given the question, the table schemas, and the ground truths. The rubric includes overall score and dimensions on correctness in terms of tables, columns, joins, filters, aggregations, etc. as well as the quality of the answer in terms of efficiency and complexity. This approach was more practical for us than running SQL queries and comparing outputs because it does not require data access, allows us to assess how close the query is to being correct, and gives deeper insights on how the model can be improved.

We discovered early on that there can be multiple ways to answer a question. About ~60% of our benchmark questions now have multiple answers. Without these additional answers, we underreported recall by 10-15%. We use expert human review every 3 months to add accepted answers to our benchmark. LLM-as-a-judge facilitates this process: we’ve found that it returns a score within 1 point of the human score 75% of the time, and larger disagreements often indicate that there’s a correct answer not in our SOT. We ask experts to review these cases and update our benchmark if needed.

Conclusion

We have been building SQL Bot for over 1 year across a virtual team that has domain expertise in our priority product areas. Our early pilots gathered a lot of interest from users, and we’ve seen sustained adoption in the months following integration into DARWIN. In a recent survey, ~95% rated SQL Bot’s query accuracy “Passes” or above, and ~40% rated the query accuracy “Very Good” or “Excellent”.

Looking ahead, there are opportunities to improve the user experience, for example through faster response time, in-line query revisions, exposing the context that SQL Bot used to answer the question, and learning from user interactions over time. Additionally, improving semantic accuracy could be facilitated by identifying champions to lead self-serve context curation efforts within their respective areas.

One takeaway is that the “Fix with AI” feature was easy to develop but has very high usage—accounting for 80% of our sessions—frequently saving users time in debugging their queries. Identifying high-ROI pain points like this is a good place to start the text-to-SQL journey.

Acknowledgements

Thanks to all contributors to the project across Engineering / Product, including Michael Cheng, Clarisse Rahbar, Sparsh Agarwal, Manohar Mohan Rao, Paul Lee, Vishal Chandawarkar.

Thanks to Trino experts for brainstorming ideas, writing a query plan parser, and reviewing this blog post: Erik Krogen, Slim Bouguerra.

Thanks to Data Science experts for curating the benchmark dataset and evaluating query accuracy: Kavi Tan, Andrew Jabara, Noora Wu, Ruoyun Guo, Steve Na, Ashish Tripathy, Michael Kosk, Lingjun Chen, Cole Silva, Feiran Ji, Janet Luo, Franklin Marsh, Mengyao Yang, Tao Lin, Huanqi Zhu, Paul Matsiras, Andrew Kirk.

Thanks to Engineering partners for providing APIs for knowledge graph construction: Shailesh Jannu, Na Zhang, Leo Sun, Alex Bachuk, Steve Calvert.

Thanks to Leadership for supporting this project: Ya Xu, Zheng Li, Jia Ding, Kuo-Ning Huang, Harikumar Velayutham, Shishir Sathe, Justin Dyer.