𝐶ℎ𝑎𝑡𝐵𝐼- Towards Natural Language to Complex Business Intelligence SQL

如果无法正常显示，请先停止浏览器的去广告插件。

1. 𝐶ℎ𝑎𝑡𝐵𝐼 : Towards Natural Language to Complex Business Intelligence SQL Jinqing Lian ★ , Xinyi Liu § , Yingxia Shao ★∗ , Yang Dong § , Ming Wang § ,Wei Zhang § , Tianqi Wan § , Ming Dong §∗ , Hailin Yan § Beijing University of Posts and Telecommunications § Baidu Inc. {jinqinglian,shaoyx}@bupt.edu.cn,{liuxinyi02,dongyang06,wangming15,zhangwei143,wantianqi,dongming03,yanhailin}@baidu.com ★ ABSTRACT 1 The Natural Language to SQL (NL2SQL) technology provides non- expert users who are unfamiliar with databases the opportunity to use SQL for data analysis. Converting Natural Language to Business Intelligence (NL2BI) is a popular practical scenario for NL2SQL in actual production systems. Compared to NL2SQL, NL2BI introduces more challenges. In this paper, we propose ChatBI, a comprehensive and efficient technology for solving the NL2BI task. First, we analyze the in- teraction mode, an important module where NL2SQL and NL2BI differ in use, and design a smaller and cheaper model to match this interaction mode. In BI scenarios, tables contain a huge number of columns, making it impossible for existing NL2SQL methods that rely on Large Language Models (LLMs) for schema linking to pro- ceed due to token limitations. The higher proportion of ambiguous columns in BI scenarios also makes schema linking difficult. ChatBI combines existing view technology in the database community to first decompose the schema linking problem into a Single View Selection problem and then uses a smaller and cheaper machine learning model to select the single view with a significantly re- duced number of columns. The columns of this single view are then passed as the required columns for schema linking into the LLM. Finally, ChatBI proposes a phased process flow different from exist- ing process flows, which allows ChatBI to generate SQL containing complex semantics and comparison relations more accurately. We have deployed ChatBI on Baidu’s data platform and inte- grated it into multiple product lines for large-scale production task evaluation. The obtained results highlight its superiority in practicality, versatility, and efficiency. At the same time, compared with the current mainstream NL2SQL technology under our real BI scenario data tables and queries, it also achieved the best results. The rapid development of LLMs has attracted the attention of re- searchers in both academia and industry, with researchers from the Natural Language Processing (NLP) and database communities hoping to use LLMs to solve the NL2SQL task. The NL2SQL task involves converting user Natural Language into SQL that can be ex- ecuted by a database, with the SQL semantics being consistent with the user’s intent. Today, thousands of organizations, such as Google, Microsoft, Amazon, Meta, Oracle, Snowflake, Databricks, Baidu, and Alibaba, use Business Intelligence (BI) for decision support. Consequently, researchers representing the industry have quickly focused on the most attention-grabbing scenario of the NL2SQL task in actual production systems: the NL2BI task, which involves converting Natural Language into BI through technology. Through NL2BI technology, many non-expert users who do not understand databases, such as product managers or operations personnel, can perform data analysis, thereby aiding decision-making. Figure 1 reflects the difference between the NL2BI and NL2SQL tasks. In actual production systems, it is humans who call the NL2BI tool, and the most significant characteristic of humans using the tool is interactivity, i.e., Multi-Round Dialogue (MRD) scenarios. For example, the three-round dialogue scenario shown in Figure 1. The Query1 indicates “the user’s query for the column of short video playback volume over the past seven days”. Both NL2SQL and NL2BI technologies can recognize the user’s intent and generate the cor- responding executable correct SQL. In the Query2, the user asks about the common query in BI scenarios, “the Week-on-Week com- parison”, which corresponds to complex semantics and comparison relationships. The presence of complex semantics and comparison relationships makes the problem of generating SQL difficult, and existing NL2SQL technology cannot generate the SQL required for BI well. Existing LLMs-based NL2SQL technology only matches Single- Round Dialogue (SRD) queries, and when there is no prompt to show that the next question is a MRD, the LLM cannot recognize the intent of the MRD well. Therefore, for the Query3 changing the intent of the column, the existing NL2SQL method cannot recognize that the user’s intent is to query “the playback duration of short videos over the past seven days and the Week-on-Week comparison”. Therefore, the existing NL2SQL method still cannot generate the SQL corresponding to the intent of the Query3. In addition to the different interaction modes brought by actual scenarios, data tables under BI scenarios often contain hundreds of columns, most of PVLDB Reference Format: XXX. 𝐶ℎ𝑎𝑡𝐵𝐼 : Towards Natural Language to Complex Business Intelligence SQL. PVLDB, (): , . doi: This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. , No. ISSN 2150-8097. doi: *Ming Dong and Yingxia Shao are the corresponding authors. INTRODUCTION

2. video.t6628 (From Tables) Column Description event_day Date sy_vv Short Video Views playtime Play time (seconds) …… (20+ lefts) Small numbers NL2SQL SELECT DATE(event_day) AS `xc_1`, sum(sv_vv) AS `mt_1` FROM video.t6628 WHERE event_day >= '2024-04-04’ AND event_day <= '2024-04-10’ GROUP BY DATE(event_day) ORDER BY `xc_1` ASC, `mt_1` DESC LIMIT 10000; ERROR SQL ERROR SQL Complex comparisons Not support Multi-Round Dialogue Query1: The short video playback volume for the past seven days User Query2: What about the week-on-week comparison? Chat 1 SELECT DATE(event_day) AS `xc_1`, sum(sv_vv) AS `mt_1` FROM video.t6628 WHERE event_day >= '2024-04-04’ AND event_day <= '2024-04-10’ GROUP BY DATE(event_day) ORDER BY `xc_1` ASC, `mt_1` DESC LIMIT 10000; video.t6628 (From Views) Column Description event_day Date sv_vv Short Video Views playtime Play time (seconds) …… (200+ lefts) Huge numbers Chat 2 Query3: What about the playback duration? Chat 3 SELECT table_origin.`xc1` AS `xc1`, table_origin.`mt_1` AS `mt_1`, (table_origin.`mt_1` - table_1Week.`mt_1`) / table_1Week.`mt_1` AS `mt_2` FROM (SELECT DATE(event_day) AS `xc1`, sum(sv_vv) AS `mt_1`, 1 AS a FROM video.t6628 WHERE event_day >= '2024-04-04’ AND event_day <= '2024-04-10’ GROUP BY DATE(event_day)) AS table_origin LEFT OUTER JOIN (SELECT DATE(DATE_ADD(DATE(event_day), INTERVAL 1 Week)) AS `xc1`, sum(sv_vv) AS `mt_1` FROM video.t6628 WHERE event_day >= '2024-03-28’ AND event_day <= '2024-04-03’ GROUP BY DATE(DATE_ADD(DATE(event_day), INTERVAL 1 Week))) AS `table_1Week` ON table_origin.`xc1` = table_1Week.`xc1` ORDER BY `xc1` ASC, `mt_1` DESC LIMIT 10000; NL2BI Multi-Round Dialogue SELECT table_origin.`xc_1` AS `xc_1`, table_origin.`mt_1` AS `mt_1`, table_origin.`mt_2` AS `mt_2`, (table_origin.`mt_1` - table_1Week.`mt_1`) / table_1Week.`mt_1` AS `mt_3`, (table_origin.`mt_2` - table_1Week.`mt_2`) / table_1Week.`mt_2` AS `mt_4` FROM (SELECT DATE(event_day) AS `xc_1`, sum(sv_vv) AS `mt_1`, sum(playtime)/60 AS `mt_2`, 1 AS a FROM video.t6628 WHERE event_day >= '2024-04-04’ AND event_day <= '2024-04-10’ GROUP BY DATE(event_day)) AS table_origin LEFT OUTER JOIN (SELECT DATE(DATE_ADD(DATE(event_day), INTERVAL 1 Week)) AS `xc_1`, sum(sv_vv) AS `mt_1`, sum(playtime)/60 AS `mt_2` FROM video.t6628 WHERE event_day >= '2024-03-28’ AND event_day <= '2024-04-03’ GROUP BY DATE(DATE_ADD(DATE(event_day), INTERVAL 1 Week))) AS `table_1Week` ON table_origin.`xc_1` = table_1Week.`xc_1` ORDER BY `xc_1` ASC, `mt_1` DESC, `mt_2` DESC LIMIT 10000; Figure 1: The difference between NL2SQL and NL2BI problems. The question consists of multiple rounds of dialogue, Query1: “The short video playback volume for the past seven days”, Query2: “What about the week-on-week comparison?”, Query3: “What about the playback duration?”. The NL2BI problem requires the ability to handle complex semantics, comparisons and calculation relationships, as well as multi-round dialogue. And the data sources for the two problems are also different. which exist in views, while tables in NL2SQL usually contain dozens of columns, most of which come from real physical tables [13]. Researchers have not put much effort into NL2BI, but they have invested considerable effort in NL2SQL. Existing methods can be di- vided into three categories. The first category is pre-trained and Su- pervised Fine-Tuning (SFT) methods [9–11, 15, 17–19, 27, 29, 31, 32], which fine-tune an “encoder-decoder” model and use a sequence-to- sequence pattern to convert Natural Language into SQL. This was the mainstream approach before the advent of LLMs. The second category is prompt engineering based LLMs. Inspired by chain- of-thought [30] and least-to-most [34], more and more research focuses on designing effective prompt engineering to make LLMs perform better in various subtasks. For example, in the NL2SQL sce- nario, DIN-SQL [24], C3 [12], and SQL-PaLM [25] provide prompt engineering to GPT-4 [6], GPT-3.5 [6], and PaLM [23], respectively, to greatly enhance the accuracy of generating SQL from Natu- ral Language. The third category is LLMs specifically trained for NL2SQL, which use domain knowledge in NL2SQL to train special- ized LLMs. However, when directly applying these methods to real analysis tasks in the BI scenario, we encountered some problems from the following perspectives: C1: Limited interaction modes. The MRD interaction scenario shown in Figure 1 considers the interactivity of users using the NL2BI method, while the vast majority of existing NL2SQL methods, due to the evaluation of the open-source leaderboards Spider [33] Table 1: Examples of ambiguous columns in BIRD dataset. db_id ice_hockey_draft ice_hockey_draft hockey hockey language_corpus language_corpus column_name sum_7yr_GP sum_7yr_TOI hOTL PPG w1st w2nd description sum 7-year game plays sum 7-year time on ice Home overtime losses Power play goals word id of the first word word id of the second word and BIRD [22], conduct queries in a SRD mode, neglecting MRD scenarios and limiting user interaction modes. For example, for the prompt engineering based LLMs such as DIN-SQL [24], MAC- SQL [28], and DAIL-SQL [16], the prompt engineering is organized in a SRD mode. Simply adding MRD prompts to the prompt en- gineering does not match MRD scenarios. The most challenging aspect is that the LLM needs to first determine whether a given sentence is part of a MRD or a SRD. For example, Query1 can be considered the SRD mode because it can generate SQL that cor- responds to the intended meaning. However, Query2 cannot be considered the SRD mode, and it requires combination with Query1 to generate SQL that meets the intended meaning. Therefore, there is a need for an accurate MRD matching method that can satisfy the user interactivity.

3. C2: Large number of columns and ambiguous columns. In the NL2BI scenario, SQL mainly consists of Online Analytical Pro- cessing (OLAP) queries. From a timeliness perspective, this leads to tables in BI scenarios often undergoing Extract-Transform-Load (ETL) processing first, incorporating a large number of columns into the same table to avoid join queries. From the perspective of actual storage, this new table containing a large number of columns often exists in the form of views. For example, one of Microsoft’s internal financial data warehouses consists of 632 tables contain- ing more than 4000 columns and 200 views with more than 7400 columns. Similarly, another commercial dataset tracking product us- age consists of 2281 tables containing a total of 65000 columns and 1600 views with almost 50000 columns [13]. The large number of columns poses a direct challenge to methods that rely on LLMs for schema linking, such as DIN-SQL [24], MAC-SQL [28], and DAIL- SQL [16]. When columns are input as tokens, it causes the number of tokens to exceed the limit, and simply truncating columns leads to a sharp drop in the accuracy of generated SQL [13]. At the same time, the proportion of ambiguous columns in BI scenarios has also expanded. Table 1 shows the columns ambiguity in the BIRD dataset, which accounts for a small proportion. However, in BI scenarios, column ambiguity accounts for a larger proportion. For example, the “uv” can be represented by the “User view”, “Unique visitor numbers”, and “Distinct number of visits”. The presence of column ambiguity will increase the likelihood of hallucinations in LLMs. C3: Insufficiency of the existing process flows. The existing process flows use LLMs to directly generate SQL. For example, MAC-SQL [28] shows accuracies of 65.73%, 52.69%, and 40.28% for Simple, Moderate, and Challenging tasks, respectively, on the BIRD dataset. DAIL-SQL [16] also has similar results. How- ever, SQL tasks in BI scenarios often contain complex semantics, comparisons 1 , and calculation relationships 2 , making the tasks usually Challenging+. Advanced LLMs demonstrate superior per- formance on NL2SQL tasks [13], albeit with higher costs (GPT-4-32K is about 60X more expensive than GPT-3.5-turbo-instruct [7] at the time of this writing). Most current process flows employ advanced LLMs such as GPT-4 [6]. However, GPT-4 has not yet opened an entry for SFT, limiting existing approaches to handling complex semantics, computations, and comparison relationships through prompt. Longer prompts tend to yield poorer results [13], signifi- cantly constraining the understanding of the aforementioned chal- lenges within the existing process flows. This means that existing LLMs-based NL2SQL methods cannot effectively handle complex semantics, calculations, and comparison relationships in BI SQL. This reflects the insufficiency of the existing process flows, where all existing methods use LLMs to understand challenging complex semantics, comparisons, and calculation relationships. In BI scenar- ios, calculation relationships are easy to list (such as Week-on-Week and Year-on-Year comparisons) but difficult for LLMs to directly perceive. Therefore, there is an opportunity to find a new process flow that can better handle complex semantics, comparisons, and calculation relationships. 1 Week-on-week comparisons in Figure 1. - table_1Week.‘mt_1’) / table_1Week.‘mt_1’ AS ‘mt_2’ in Figure 1. 2 (table_origin.‘mt_1’ Our approach. To address the challenges described above, we introduce ChatBI for solving the NL2BI task within actual produc- tion systems. For C1, unlike previous methods that focus solely on SRD scenarios, we treat the matching of MRD scenarios as a crucial module and identifies corresponding features, columns, and dimensions. We employ two smaller and cheaper Bert-like mod- els to first conduct text classification and then text prediction for MRD matching. Compared to the method of inputting all dialogues as a sequence for prediction, ChatBI introduces a recent matching mode, which shows improved performance. For C2, we initially integrate mature view technology from the database community, converting the schema linking problem into a Single View Selec- tion problem. A smaller and cheaper single view selection model is used to select a single view, and all columns within this view are treated as the schema. The ability of views to retain a single column for ambiguous columns effectively addresses the problem of column ambiguity. Lastly, for C3, since the existing process flows predominantly utilize GPT-4, which cannot perform SFT and must rely solely on prompts, leading to a failure to understand complex semantics, comparisons, and computational relationships, we have designed a phased process flow and introduced virtual columns. Us- ing this innovative process flow and virtual columns, we decouple LLMs from the understanding of complex semantics, comparisons, and computational relationships, and SQL language itself, thereby increasing their accuracy by reducing the task difficulty faced by the LLMs. In the new process flow, we position the SQL generation step after the LLM using existing rule-based methods, thus avoiding most errors caused by the hallucinations of LLMs. Specifically, we make the following contributions: • Considering the MRD scenario that objectively exists in the real-world use of NL2BL technology, we propose and solve the MRD problem in the NL2BI scenario. • Unlike previous methods that directly use LLMs for schema linking, we transform schema linking into a single view selection problem using view technology from the database community. We then use the columns corresponding to the single view as the selected columns input to the LLM. • Unlike previous methods that directly use LLMs to gener- ate SQL, we propose a phased process flow. This process outputs structured intermediate results, decoupling some content that the LLM needs to perceive from the original process. Complex semantics, calculations, and comparison relationships are handled through structured intermediate results using mapping and other methods. The structured nature of SQL makes the new process more effective. • We have deployed the proposed method in a production environment, integrated it into multiple product lines, and compared it with mainstream NL2SQL methods. ChatBI has achieved optimal results in terms of accuracy and efficiency. In the following sections, we will provide details of ChatBI. In section 2, we present necessary backgrounds. Section 3 introduces the overview of ChatBI. We then introduce the Multi-Round Di- alogues Matching and Single View Selection in Section 4 and the Phased Process Flow in Section 5. We evaluate our method in Section 6. Finally, we conclude this paper in Section 7.

4. Natural language question Q: The short video playback volume for the past seven days? NL2SQL method SQL SELECT DATE(event_day) AS `xc_1`, sum(sv_vv) AS `mt_1` FROM video.t6628 WHERE event_day >= '2024-04-04’ AND event_day <= '2024-04-10’ GROUP BY DATE(event_day) ORDER BY `xc_1` ASC, `mt_1` DESC LIMIT 10000; Database (Video) Natural language question Q1: The short video playback volume for the past seven days? Schema (4 tables or 4 views) t6628: event_day (74 columns) sv_vv playtime … t6847: event_day (55 columns) uid dau … t6867: event_day (35 columns) vid click_pv … t7479: event_day (62 columns) cuid uv … Metadata (types, description) { “t6628.event_day”:{“type”: “DATE”, “description”: ”Date”}, “t6628.sv_vv”:{“type”: “DECIMAL”, “description”: “Short video distribution volume”}, “t6628.playtime”:{“type”: ”DECIMAL”, “description”: “Total playback duration (seconds)”}, … “t6847.uid”:{“type”: “VARCHAR”, “description”: ”Logged-in user ID”}, “t6847.dau”: {“type”: “Virtual Metric”, “description”: “count(distinct if (uid is not null, cuid, null))”}, … } Q2: What about the week-on-week comparison? NL2BI method SQL SELECT table_origin.`xc_1` AS `xc_1`, table_origin.`mt_1` AS `mt_1`, table_origin.`mt_2` AS `mt_2`, (table_origin.`mt_1` - table_1Week.`mt_1`) / table_1Week.`mt_1` AS `mt_3`, (table_origin.`mt_2` - table_1Week.`mt_2`) / table_1Week.`mt_2` AS `mt_4` FROM (SELECT DATE(event_day) AS `xc_1`, sum(sv_vv) AS `mt_1`, sum(playtime)/60 AS `mt_2`, 1 AS a FROM video.t6628 WHERE event_day >= '2024-04-04’ AND event_day <= '2024-04-10’ GROUP BY DATE(event_day)) AS table_origin LEFT OUTER JOIN (SELECT DATE(DATE_ADD(DATE(event_day), INTERVAL 1 Week)) AS `xc_1`, sum(sv_vv) AS `mt_1`, sum(playtime)/60 AS `mt_2` FROM video.t6628 WHERE event_day >= '2024-03-28’ AND event_day <= '2024-04-03’ GROUP BY DATE(DATE_ADD(DATE(event_day), INTERVAL 1 Week))) AS `table_1Week` ON table_origin.`xc_1` = table_1Week.`xc_1` ORDER BY `xc_1` ASC, `mt_1` DESC, `mt_2` DESC LIMIT 10000; Figure 2: An example of metadata in NL2SQL and NL2BI. “dau” and “uv” are Virtual Columns. 2 PRELIMINARY NL2SQL Task. The primary goal of the NL2SQL task is to construct a SQL query S from a given Natural Language question Q, a database D and a corresponding database table T. The S should be executable and accurately address the informational intent encapsulated in Q, thereby producing a valid response within the context provided by D and T : 𝑆 = 𝑃𝑎𝑟𝑠𝑒𝑟 (𝑄, 𝐷,𝑇 ), (1) where the Parser() is designed to analyze Q leveraging D and T. D contains database schema and database metadata, while T contains database metadata which contains column types and descriptions. An example of the metadata is presented in Figure 2. NL2BI Task. The primary goal of the NL2BI task is to construct a SQL query S from given Multi-Round Natural Language question MRQ, a database D and a corresponding database view V : 𝑆 = 𝑃𝑎𝑟𝑠𝑒𝑟 (𝑀𝑄, 𝐷, 𝑉 ), (2) Virtual Column. Virtual columns are columns that are typically calculated in BI scenarios and their corresponding rules, which do not physically exist in tables or views. As shown in Figure 2, within table or view “t6847”, there is a column named “uid”, which represents the logged-in user ID. In the BI scenario, DAU is a crucial column, standing for Daily Active Users. It signifies the number of active users per day. In “t6847”, “dau” exists as an virtual column. The description associated with “t6847.dau” is “count( distinct if(uid is not null, uid, null) )”, which specifies the computation syntax for obtaining DAU. 3 SYSTEM OVERVIEW Figure 3 presents an overview of ChatBI, which comprises three modules: Multi-Round Dialogues Matching, Single View Selec- tion, and a Phased Process Flow. Specifically, after a user poses a Natural Language question, the Multi-Round Dialogues Matching module determines whether the question is a follow-up in a Multi- Round dialogue or a standalone query in a Single-Round dialogue to ensure the question encapsulates the user’s complete intent. The Single View Selection module identifies the corresponding view that matches the user’s intent based on the question containing the full intent, and organizes this corresponding view and the question into prompts that are passed to a LLM. Unlike previous process flows that directly generate the target SQL from these prompts, the phased prcess flow first requires the LLM to output a JSON interme- diary. This intermediary serves as the input for a rule-based SQL generation method, producing the target SQL. Existing rule-based methods struggle with complex semantics, computations, and com- parison relationships. Therefore, ChatBI introduces a method of Virtual Column in the phased process flow. This method avoids the need for the LLM to learn complex semantics, computations, and comparison relationships, reducing the difficulty of the problem- solving process for the LLM in the phased process flow. Finally, with the help of Apache SuperSet [1], ChatBI enables the output of custom results in various formats (e.g., charts, line graphs). After each query is executed, it is evaluated by users to determine if the generated SQL resolves the query. If resolved, it is counted in the View Advisor. And if not, it undergoes analysis, and cases where the failure is due to the absence of a view, prompting the Single View Selection module to fail, are assessed by the DBA to decide if a new view should be established. This ensures the accuracy of the Single View Selection module. Based on the counts, the DBA can also dynamically delete seldom-used views, achieving dynamic management of views. 4 BALANCES ECONOMY AND ACCURACY. In this section, we first discuss Multi-Round Dialogue matching in Section 4.1, which addresses the interactivity of user queries. Subsequently, in section 4.2, we introduce the Single View Selection method designed to meet the challenges posed by the large number of columns and column ambiguities in BI scenarios. 4.1 Multi-Round Dialogue Matching In BI scenarios, dimensions and columns are the two most crucial types of features. Dimensions typically correspond to statistical times or granularity, such as “the past three days” representing a time dimension, while columns correspond to specific columns, such as the “User visitors” column representing the number of user visits. Typically, when a query contains both dimensions and columns, such as the query “User visits in the Beijing city yester- day”, it is possible to generate the corresponding SQL, and such a query can be considered as the current query. However, when a query lacks either a dimension or a column, for example, “What about Tianjin?”, it cannot generate the corresponding SQL, thus it is necessary to treat it as part of a MRD query, using historical queries as sequence input to predict the current query. As shown in Figure 4, ChatBI splits the MRD matching problem into two tasks: a classification task and a prediction task. Classification Task. From the perspective of the timeliness of calls in actual production systems, ChatBI uses a Pre-training model rather than a LLM to perform this text classification task. In text classification tasks, the State-of-the-art (SOTA) Pre-training method is Bert [31]. In our production system, where the user input is in Chinese, we use the ERNIE [26] model, which is a Bert-like model that performs better on Chinese NLP tasks. The effectiveness of this classification model is demonstrated by the following formula

5. Rounds ↑ 0 Multi-Round Dialogues Matching 1 1 0 0 1 0 0 0 1 1 Input Tokens … Field Field SFT ERNIE (Tiny) Field … Field LLMs 1 0 1 0 1 Input Questions View Advisor … … Views Tables JnM (Json nested Map) Phased Process Flow 0 0 Tables and Views Target View Single View Selection 1 Append Predicted Questions Questions Users 1 SFT ERNIE (Tiny) Json Example { “Key1” : [List1], … “KeyN” : “ValueN” } Map DBA Virtual Column Key1 Metric1 Key2 Metric2 … … KeyN MetricN Apache SuperSet Results Users Figure 3: The Overview of ChatBI. Rounds ↑ 0 Multi-Round Dialogues Matching 1 Users 1 0 0 1 0 0 1 SFT ERNIE (Tiny) 0 1 Append 1 1 0 Predicted Questions Questions 0 1 0 1 0 1 Input Questions Prediction Task Classification Task : Query before classification 0 : Query that lacks dimension or metric. 1 : Query that contains both dimension and metric. Q: Yesterday Beijing’s User visitors. Q: What about Tianjin? Q: What was the DAU yesterday? SFT ERNIE (Tiny) 1 : The predicted Query that contains both dimension and metric. Recent matching mode 1 1 1 0 0 0 1 1 0 0 SFT ERNIE (Tiny) 1 1 1 Figure 4: The method used in MRD Matching. 𝐼 (𝑄): {︄ 𝐼 (𝑄) = 1 𝑖 𝑓 𝑄 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑠 𝑏𝑜𝑡ℎ 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠 𝑎𝑛𝑑 𝑐𝑜𝑙𝑢𝑚𝑛𝑠, 0 𝑖 𝑓 𝑄 𝑙𝑎𝑐𝑘𝑠 𝑒𝑖𝑡ℎ𝑒𝑟 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠 𝑜𝑟 𝑐𝑜𝑙𝑢𝑚𝑛𝑠. (3) Next, we detail the specifics of the Pre-training dataset. We em- ploy a straightforward method to construct the Pre-training corpus. Firstly, for Q labeled as 1, which includes queries containing both di- mensions and columns, the distinction between whether a column in a database table or view is a dimension or column is quite clear. For instance, “event_day” typically represents a time dimension, and “app_id” often represents different business lines, also a dimen- sion. Columns, such as “sv_vv” which represents the distribution volume of short videos, and “dau” which represents daily active users, are clearly defined as such. Simply concatenating dimension and column can generate Q labeled as 1, such as “event_day + sv_vv” could produce the training data: “What is the volume of short video plays over the past three days?”. For Q labeled as 0, it is natural to create data containing only one of dimensions or columns by re- moving either dimension or column from data labeled as 1. For data that contains neither dimensions nor columns, we utilize GPT-4 for generation. The training model employs a 4-layer transfer_former ERNIE (Tiny) model, which first classifies each user query. Prediction Task. Starting from the perspective of timeliness, ChatBI does not directly use LLMs for prediction. Queries labeled as 1 by the classification model can be directly treated as current queries, while those labeled as 0 require processing to become current queries. An initial idea is to use all of a user’s historical queries as sequence input to predict queries that meet the criteria of containing both dimensions and columns. This approach has been found to be inef- fective (details discussed in Section 6). We observe that user queries mostly have logical coherence and often follow up on previous queries, so the most recent queries labeled as 0 are typically supple- ments or follow-ups to the most recent queries labeled as 1. Based on this observation, we have designed a recent matching mode: for a query, we identify the timestamp of the most recent query classified as 1 in the user’s historical inputs, and from this timestamp to the current moment, we take all queries posed by the user as a sequence to a 4-layer transfer_former ERNIE (Tiny) for prediction, treating the predicted query as the current query. In summary, Algorithm 1 illustrates the main procedures of the MRDs matching algorithm. For the user’s MRDs, when the latest dialogue contains both dimensions and columns, it can be directly used as the current query input (Line 10). Otherwise, we select the

6. Algorithm 1 MRDs Matching Algorithm Input: MRDs queries {𝑄 1 , 𝑄 2 , ..., 𝑄 𝑁 } (𝑄 𝑁 is the latest query) Output: A query 𝑄 is labeled as one 1: if 𝐼 (𝑄 𝑁 ) is Zero then 2: 𝐼 𝐷 ← -1 3: for 𝑖 ← 𝑁 , 𝑁 − 1, . . . , 2, 1 do 4: if 𝐼 (𝑄 𝑖 ) is One then 5: 𝐼𝐷 ← i 6: break 7: end if 8: end for 9: else 10: Return 𝑄 𝑁 11: end if 12: if 𝐼 𝐷 ≠ -1 then 13: Use {𝑄 𝐼 𝐷 , 𝑄 𝐼 𝐷+1 , . . . , 𝑄 𝑁 } as the inputs, and use the SFT ERNIE Model to predict the 𝑄 𝑁 +1 14: Append 𝑄 𝑁 +1 to MRDs queries 15: else 16: Return None 17: end if 18: Return 𝑄 𝑁 +1 nearest dialogue with label one as the starting point and the latest dialogue as the endpoint (Line 3 - Line 8), input all dialogues in the interval into the ERNIE model to predict dialogues with label one (Line 13), and directly add them to the MRDs (Line 14). The user’s current query is used as the Natural Language input required by the single view selection. 4.2 Single View Selection Schema linking is the process of selecting the target columns re- quired by a Natural Language from all the columns in the data tables. This task presents a heightened challenge in BI scenarios due to the large number of columns, in contrast to ordinary sce- narios. In previous work [13], LLMs were limited by the number of tokens, preventing them from processing tables with large num- ber of columns, and the longer the prompt, the more severe the degradation of the model’s capabilities. Additionally, this work [13] mentions that the optimal method for schema linking is to provide only the optimal tables and columns (29.7%), followed by providing only the optimal tables (24.3%), and then the LLM’s own selec- tion (21.1%). Advanced LLMs demonstrate superior performance on NL2SQL tasks [13], albeit with higher costs (GPT-4-32K is about 60X more expensive than GPT-3.5-turbo-instruct [7] at the time of this writing). For DIN-SQL [24], generating a single SQL query costs $0.5, which is unacceptable in actual production systems. Therefore, both from the perspective of reducing the number of tokens and solving the schema linking problem, using a smaller and cheaper model to provide the optimal tables and columns has become the optimal choice. Meanwhile, it is noteworthy that Views, despite their mature development in data warehouses, data mining, data visualiza- tion, decision support systems, and BI systems, are absent in GPT-4 Instruction As a data analyst, you currently have access to a data table with dimensions: [Date, Opening Method, Search Type, Baidu App Matrix Product, Operating System], and statistical indicators: [Search DAU, Total Search Traffic, Total Distribution Volume from First Search, Number of New Users from Search]. The dimension [Operating System] includes the enumeration values: [Android, iOS, Other]; [Opening Method] includes: [push, proactive opening, other, overall, launch]; [Search Type] includes: [new search, traditional search]; [Baidu App Product] includes: [Baidu App main version, Baidu App Lite version, Baidu App large font version, Honor white-label, overall]. The indicator field [Search DAU] is also known as [Daily Active Users for Search, Daily Active Search], and [Total Search Traffic] is also known as [Search Traffic, Search Count, Number of Searches]. Based on the dimensions and statistical indicators of this table, generate 20 data query questions. Your questions must be limited to the table's available dimensions and indicators, must include the dimensions [Opening Method, Date], and be as concise as possible. You may use the dimension's enumeration values and the aliases of the indicator fields. Q1: How many Daily Active Users for Search were there on [specific date] when the app was proactively opened? Q2: What was the Total Search Traffic on [specific date] for proactive openings? ... Q19: What was the total number of searches on [specific date] for new searches launched by other methods? Q20: How many searches were conducted on [specific date] using proactive openings in the Honor white-label version? Figure 5: The prompt for SFT training data generation using GPT-4. the Spider [33] and BIRD [22] datasets. These Views are estab- lished upon business access and exist objectively within real data tables. The presence of Views substantially mitigates the need for JOIN operations between tables, enabling the execution of queries that would otherwise be dispersed across multiple tables on a single table. And if there exists an optimal single view that contains the optimal columns, the view is the answer. Otherwise, there may be two or more views that require a join operation to be resolved or other errors. We report all such join scenarios to the DBA in View Adviosr, who evaluates the proportion of these types of queries and decides whether to establish views for such queries to avoid join operations. Consequently, in the context of NL2BI tasks, an optimal strategy for decomposing the problem is to convert the schema linking issue into a Single View Selection problem. Given multiple Views Vs, each containing numerous columns that genuinely exist in the data tables without duplication, and each column potentially appearing in zero or more Vs, the objective is to select a View V that encompasses all the target columns Cols corre- sponding to the user’s Natural Language query while minimizing the total number of columns in the V. The Single View Selection problem is defined as follows: 𝑉 = 𝑆𝑒𝑙𝑒𝑐𝑡 (𝑉 𝑠, 𝐶𝑜𝑙𝑠), (4) where the Select() is designed to find a V that contains Cols in Vs. The abundance of Cols in BI scenarios poses a challenge when inputting all schemas as tokens into LLMs, leading to the issue of token number exceeding the limit. Furthermore, the same column across different tables in the same business may be expressed in multiple ways, resulting in column ambiguity. Addressing the Sin- gle View Selection problem can significantly compress the token count of the schema linking results, enabling the LLMs to accom- modate schemas. The logical matching relationships established during V creation effectively resolve column ambiguity. Considering online invocation and economic benefits, we de- cide to first adopt smaller and cheaper models and assess whether they can effectively complete view selection. In line with the initial approach of not employing a LLM to tackle MRD scenarios, we designed and compared three lightweight models for practical appli- cation: a Convolutional Neural Network (CNN ) model (1 layer and

7. Table 2: The accuracy of each view selection model. Stage 1 NL2SQL Technology’s Process Flow Database metadata + Question (Query) Baseline SFT ERNIE (4 layers) Bert (12 layers) CNN (1 layer, 3-channel) ACC 95.7% 95.2% 83.9% ChatBI’s Process Flow NL2SQL LLMs Stage 1 Database metadata + Question (Query) 3-channel), a Bert model (12 layers) and an SFT ERNIE model (4 lay- ers) for view selection data. The training data, generated as shown in Figure 5, specifically comprises real Views and their included Cols, a capability inherent to BI scenarios. We use this method to generate over 3,000 entries as training data and 500 entries as test data. We then validate three models with the results presented in Table 2. Based on the comparison of experimental results, the SFT ERNIE Tiny model was ultimately selected. ChatBI also maintains a View Advisor module within its Single View Selection. After ChatBI finishes running, all queries that fail the single view selection are collected into the View Advisor and handed over to the business line’s DBA for evaluation. The DBA assesses the proportion of such queries in user scenarios and decides whether to establish views to support such queries and avoid join operations. Meanwhile, all queries that successfully select a single view increment the hit count for the corresponding view by one. In practice, through the View Advisor, the DBA can easily add new views and delete views that have not been used for a long time. 5 DECOMPOSE PROBLEMS BEFORE SOLVING THEM. In this section, we first discuss a phased process that differs from previous methods in Section 5.1. Subsequently, in Section 5.2, we explore how ChatBI employs Virtual Columns within this process. 5.1 Phased Process Flow In BI scenarios, complex semantics, computational relationships, and comparison relationships make it difficult for existing meth- ods to handle NL2BI tasks effectively. Current process flows rely solely on SFT data or prompts, and it is challenging to enumerate all relationships and provide their corresponding golden SQL. Even if these relationships and their golden SQL are obtained, incorpo- rating them into prompts is not feasible, as it would lead to token overflow and, concurrently, the trimming could impair the LLM’s understanding of other knowledge. Recent studies have demon- strated that the performance of LLMs on complex tasks can be enhanced through a decomposed prompting technique [20]. This method entails segmenting a task into multiple steps and employ- ing the intermediate results to synthesize a final answer. Unlike algebraic expressions, which consist of clearly delineated steps or operations, the deconstruction of a complex SQL query poses a significantly greater challenge due to the declarative nature of SQL and the intricate interrelations among various query clauses [24]. In response to these challenges, ChatBI introduces a phased pro- cess flow designed to decompose the NL2BI problem, specifically aiming to effectively handle complex semantics, computational relationships, and comparison relationships within BI scenarios. Map Stage 2 ChatBI JnM (Json nested Map) Json Example { “Key1” : [List1], … “KeyN” : “ValueN” } SQL results Json results Virtual Column Key1 Metric1 Key2 Metric2 … … KeyN MetricN Apache Superset SQL results Multiple forms of output Table Chart Text Figure 6: The difference between ChatBI and other NL2SQL technology’s process flow. As shown in Figure 6, ChatBI introduces a phased process flow that initially utilizes a LLM to generate JSON-formatted intermedi- ate outputs. These outputs are then fed into BI middleware such as Apache SuperSet, which supports various output formats, to display results. The key difference in the new process flow com- pared to the traditional reliance on LLMs for SQL generation is its reliance solely on generating JSON. This approach does not require, as DAIL-SQL [16] does, providing extensive Natural Language and its corresponding golden SQL in the prompt, which significantly reduces the number of tokens. The new process flow does not require the LLM to understand the complex semantics associated with various SQL join operations, nor does it need to grasp specific comparison relationships like DtD (Day-to-Day) comparisons typical in BI scenarios, or the complex computational relationships among columns. By decoupling the complexity of SQL problems from the LLM, the difficulty of gener- ating JSON in the new process is merely Simple (according to the task grading by DIN-SQL and MAC-SQL based on the complexity of SQL generation tasks, since ChatBI does not need to understand SQL, all are classified as Simple tasks). This significantly enhances the accuracy of the process flow. The step of generating SQL is exe- cuted within Apache SuperSet, where extensive research from the database community on generating SQL based on dimensions and columns already exists, such as CatSQL [14], which combines deep learning models with rule-based models. ChatBI implements multi- ple universal templates in Apache SuperSet. And the JSON output from the LLM acts like filling in the blanks for the corresponding query template, eventually outputting SQL. 5.2 Virtual Column Although the phased process flow, which first breaks down NL2SQL into NL2JSON, reduces the task’s complexity, the lack of SQL un- derstanding cannot be simply compensated by templates in Apache SuperSet. Particularly for columns like “DAU”, which require com- putational relationships derived from other columns, the complexity of these relationships significantly reduces the accuracy of the tem- plates. Therefore, ChatBI introduces the concept of Virtual Columns, as illustrated in Figure 2. For the “dau” column, it does not phys- ically exist in the data table or view but merely stores the SQL computation relationships needed to derive it. Virtual columns like these can be accessed through their corresponding key (column

8. name) to find the value (computation rule). We represent virtual columns using a map structure nested into the JSON, referred to as JnM, ensuring that all queries hitting Virtual Columns directly access the stored computation rules. Although the computational rules for columns like “DAU” in the BI column are relatively standard, the NL2BI task possesses its uniqueness. However, we believe that when applying LLMs to specific tasks, they can be broken down into NL2JSON like ChatBI, utilizing a map stored in the JSON to manage complex rules. This approach not only reduces the difficulty of generating tasks for LLMs, enhancing their accuracy, but also cleverly uses JnM to per- ceive complex semantics and relationships. Relationships similar to those in virtual columns can be generated by LLMs, thereby facilitating caching to speed up computations in future calls. In addition to “DAU”, here are two examples of Virtual Columns in ChatBI. “staytime” represents the video playback duration in seconds. A frequently asked column is the total playback dura- tion in minutes, whose Virtual Column, stay_time_min, is stored in the view’s schema with the calculation rule “sum(staytime) / 60”. Another commonly queried column, the average playback dura- tion per user, stay_time_per, is also stored in the view’s schema, with the calculation rule “sum(staytime)/60/count(uid)”, where “uid” represents the user ID. 6 EXPERIMENTAL EVALUATION 6.1 Experimental Setup Datasets The Spider [33] and BIRD [22] datasets are commonly utilized to evaluate the efficacy of NL2SQL parsers across diverse databases. The two datasets are used in the NL2BI evaluation con- text, where they adopt a SRD mode and overlook the significant number of existing views. The presence of views can optimize most cross-table queries within a single view. Therefore, we have devel- oped two datasets from real BI scenario data, SRD dataset and MRD dataset without golden SQLs, each provided with 4 views, as shown in Figure 7. Each dataset contains 226 columns and 4 views and encompasses 153 real BI Chinese queries. To enhance the difficulty of queries in the MRD dataset, 153 queries consist of 51 three-round dialogues. Among these dialogues, 19 involve queries that switch views, meaning the tables accessed in the three rounds of dialogue differ. We consider these 19 dialogues to represent chal- lenging tasks. BIRD and Spider provide a golden SQL for each query, a process that is quite challenging to construct. Therefore, from the perspective of actual production systems, both SRD dataset and MRD dataset do not have golden SQLs. Evaluation Metrics Additionally, these NL2SQL technologies em- ploy exact string match and execution match metrics. The former only compares whether results share the same string expression, which may lead to many correct results being considered incor- rect. Relying solely on execution results depends on the dataset’s predefined golden SQL, which could also lead to misjudgments. For instance, in a BI context, for the query: “Yesterday’s interaction metrics for users of app1”, the following golden SQL is as follows: 1 2 3 4 SELECT SUM ( click_pv ) SUM ( disp_pv ) AS SUM ( play_pv ) AS SUM ( like_pv ) AS AS total_click_pv , total_disp_pv , total_play_pv , total_like_pv , Table 3: LLM leaderborad [5] 2024-03. Avg means the Average performance, Lg means the Language and Knowl means the Knowledge LLM GPT-4-Turbo Claude3-Opus GLM-4 Qwen-MAX-0107 Qwen-MAX-0403 Qwen-72B 5 6 7 8 9 10 11 O/C Source Close Source Close Source Close Source Close Source Close Source Open Source Avg 62 60.5 57.8 55.8 55.6 54.5 Lg 54.9 44.6 57.7 59.4 58.8 59.6 Knowl 66.3 68.9 71 55.4 70.4 67.8 Agent 82 81.1 73.4 60 73.7 54.6 SUM ( comment_pv ) AS total_comment_pv , SUM ( collection_pv ) AS total_collection_pv , SUM ( shareto_pv ) AS total_shareto_pv , SUM ( follow_pv ) AS total_follow_pv FROM table1 WHERE event_day = DATE_SUB ( CURRENT_DATE , INTERVAL 1 DAY ) AND appid = ' app1 '; These columns click_pv, disp_pv, play_pv, like_pv, comment_pv, collection_pv, shareto_pv, and follow_pv are interaction metrics. Yet any non-empty set containing one or several of click_pv, disp_pv, play_pv, like_pv, comment_pv, collection_pv, shareto_pv, or fol- low_pv can be considered correct. The SQL below is also deemed correct: 1 2 3 4 5 SELECT SUM ( click_pv ) AS total_click_pv , SUM ( disp_pv ) AS total_disp_pv FROM table1 WHERE event_day = DATE_SUB ( CURRENT_DATE , INTERVAL 1 DAY ) AND appid = ' app1 '; Following Previous work [13], we consider using the metric useful- ness rather than correctness. So we use the metric useful execution accuracy (UEX) which compares the execution output of the pre- dicted SQL query and the intent of this query. SQL queries with column errors and syntax errors are invariably considered incorrect. The results of all executable queries are assessed by experienced DBAs to determine whether they satisfy the user’s intent for that specific query. We also provide the number of tokens used in the prompts and responses for each method to assess a certain eco- nomic cost. We use prompt tokens (PT) to represent the number of tokens used in the prompts, and response tokens (RT) to represent the number of tokens used in the responses. Baselines • DIN-SQL [24]. Din-SQL deconstructs the NL2SQL task into discrete subtasks, formulating distinct prompts for each subtask to guide GPT-4 through the completion of each component, thereby deriving the final SQL. • MAC-SQL [28]. MAC-SQL utilizes multi-agent collabora- tion for NL2SQL tasks and reach an accuracy of 59.6% BIRD. • ChatBI. ChatBI employs all the modules introduced in this paper, matching Multi-Round Dialogues and performing Single View Selection through a phased processing flow. Initially, a LLM outputs JSON, which is subsequently used to generate SQL that predicts the SQL corresponding to the query.

9. SRD Dataset MRD Dataset Q1：近三天主版DAU及日环比和周同比波动情况？ Q1-1：近三天主版DAU及日环比和周同比波动情况？ What is the DAU of the main version in the last three days, including daily and weekly fluctuations? What is the DAU of the main version in the last three days, including daily and weekly fluctuations? Q2：近三天极速版DAU及日环比和周同比波动情况？ Q1-2：极速版呢？ What is the DAU of the lite version in the last three days, including daily and weekly fluctuations? What about the lite version? Q3：查询昨天安卓新增用户有多少？ Q1-3：新增用户的情况呢？ How many new Android users were there yesterday? What about the new user additions? … … Q153：近七天男女活跃人数对比？ Q51-3：近三个月的呢？ What is the gender comparison of active users over the past seven days? What about the last three months? Figure 7: Introduction to the SRD dataset and MRD dataset. The main version and the lite version correspond to different apps. DAU stands for Daily Active Users, and new users refer to users who are registering for the first time. Table 4: Useful execution accuracy (UEX), prompt tokens (PT) and response tokens (RT) on SRD dataset. Baseline DIN-SQL MAC-SQL ChatBI ChatBI w/o PF + zero-shot UEX 20.92% 52.29% 74.43% 58.17% PT 1472242 1272661 841550 586640 RT 53332 36952 29336 28886 • ChatBI w/o PF + zero-shot. The primary difference be- tween ChatBI w/o PF + zero-shot and ChatBI is that it does not employ a phased process flow. Similar to other NL2SQL methods, it directly generates the corresponding SQL using a LLM. 6.2 Evaluations on SRD Dataset It is important to note that the all experiments utilize the “2023-03- 15-preview” version of GPT-4 [6] according to Table 3 and ChatBI uses the Erniebot-4.0 [3] (The work is also done by Baidu Inc.) in actual production systems due to data privacy concerns. In Table 6, we report the performance of our method and baseline methods on the SRD dataset. It is evident our method surpasses other baseline methods using less tokens on both prompt and response. ChatBI achieves higher UEX with fewer tokens on SRD dataset. Compared to MAC-SQL, ChatBI improves the UEX metric by 42.34% (74.43−52.29) (1272661−841550) ( ), using 33.87% ( ) fewer tokens in the 52.29 1272661 Table 5: Useful execution accuracy (UEX), prompt tokens (PT) and response tokens (RT) on SRD dataset of ChatBI w/o PF + zero-shot and ChatBI w/o PF + few-shot. Baseline ChatBI w/o PF + zero-shot ChatBI w/o PF + few-shot UEX 58.17% 49.67% PT 586640 735144 RT 28886 24126 (36952−29336) prompt stage and 20.61% ( ) fewer tokens in the re- 36952 sponse stage. Relative to DIN-SQL, ChatBI increases the UEX metric (74.43−20.92) (1472242−841550) by 255.78% ( ), utilizing 42.84% ( ) fewer 20.92 1472242 (53332−29336) tokens in the prompt and 44.99% ( ) fewer tokens in 53332 the response. ChatBI and MAC-SQL both provide column values in their prompts, which offers significant adaptability for BI scenarios that require explicit column specifications, and it largely prevents SQL execution errors caused by the hallucinations of LLMs. In contrast, DIN-SQL does not provide column values in its prompts, resulting in a higher likelihood of errors in the filter conditions of the generated SQL. Furthermore, during the execution of DIN-SQL, it has been observed that it fails to analyze queries in Chinese. This issue arises because its prompts consist entirely of English without any Chinese elements. In cases where DIN-SQL produces unana- lyzable SQL for Chinese queries, we re-execute it until it outputs SQL. Therefore, we believe that ChatBI achieves higher accuracy with fewer tokens used in the NL2BI SRD task.

10. Table 6: Useful execution accuracy (UEX), prompt tokens (PT) and response tokens (RT) on MRD dataset. , the prompt for Query2 is shown below: 1 2 Baseline DIN-SQL MAC-SQL ChatBI ChatBI w/o PF + zero-shot UEX 20.26% 35.29% 67.32% 52.94% PT 1526638 1685080 799442 592074 RT 65850 30095 20766 18723 3 4 5 The phased process flow assists LLMs in understanding com- plex semantics, computations, and comparison relationships. (841550−586640) Although ChatBI uses 43.45% ( ) more tokens in 586640 prompts compared to ChatBI w/o PF + zero-shot, it achieves a (74.43−58.17) 27.95% ( ) improvement in UEX. We find that this is 58.17 primarily because the phased process flow enables the LLM to by- pass the need to comprehend complex semantics, computations, and comparison relationships directly within SQL. Instead, it only needs to understand how to map these complex relationships to the outputs required in JSON. By reducing the complexity of the task, the LLM can provide more accurate results, leading to an enhancement in UEX. Although the phased processing flow does not require the LLM to perceive SQL directly, it still necessitates an understanding of the JSON data type and the various parameter forms needed for subsequent SQL generation, hence more tokens are used in the prompts. Longer prompts do not necessarily lead to better results. ChatBI w/o PF + zero-shot, which merely changes its output format from JSON to SQL compared to ChatBI, prompted us to explore whether increasing the number of shots could enable direct SQL generation to achieve or surpass the performance of JSON gen- eration. In Table 5, we examine the UEX, PT, and RT for ChatBI w/o PF + zero-shot and ChatBI w/o PF + few-shot on the SRD dataset. The table shows that with more shots provided, ChatBI (58.17−49.67) w/o PF + few-shot actually experienced a 14.61% ( ) de- 58.17 crease in UEX compared to ChatBI w/o PF + zero-shot, which uses (735144−586640) 25.31% ( ) more tokens in prompts. Similarly, previ- 586640 ous studies [13] have also mentioned that longer prompts might cause degradation in large language models. Furthermore, shots that yield positive outcomes often require meticulous design [16]. Therefore, we do not extensively discuss the impact of shots in this paper. 6.3 Evaluations on MRD Dataset Since DIN-SQL [24] and MAC-SQL [28] are not designed for Multi- Round Dialogues, we have made the following modifications in the prompts. The MRD Dataset consists of 51 sets of three-round queries, namely Query1, Query2, and Query3. The prompt for Query1 is as follows: 1 2 3 4 5 /* As a data analysis expert , you are to extract the necessary information from the data provided and output the corresponding SQL query based on the user 's question . Let us consider the problem step by step . The current question is { Query1 }. */ /* As a data analysis expert , you are to extract the necessary information from the data provided and output the corresponding SQL query based on the user 's question . Let us consider these problems step by step . The last question is { Query1 } , and the current question is { Query2 }. */ , and the prompt for Query3 is also provided as follows: 1 2 3 4 5 /* As a data analysis expert , you are to extract the necessary information from the data provided and output the corresponding SQL query based on the user 's question . Let us consider these problems step by step . The last two question are { Query1 } and { Query2 }, and the current question is { Query3 }. */ . In reality, within actual scenarios, LLMs do not know whether a query belongs to a MRD or a SRD for a better interactive experience. Unless it is explicitly required that users inform the LLM that the issue should be considered in conjunction with the previous ques- tion. In the experiments, ChatBI is not informed about which turn of a MRD the dialogue belongs to. And this is entirely determined by the Multi-Round Dialogue Matching module. ChatBI achieves higher UEX with fewer tokens on MRD dataset. Compared to MAC-SQL, ChatBI improves the UEX metric (67.32−35.29) (1685080−799442) by 90.76% ( ), using 52.56% ( ) fewer to- 35.29 1685080 (30095−20766) kens in the prompt stage and 31.00% ( ) fewer tokens in 30095 the response stage. Relative to DIN-SQL, ChatBI increases the UEX (67.32−20.26) (1526638−799442) metric by 232.28% ( ), utilizing 47.63% ( ) 20.26 1526638 (65850−20766) fewer tokens in the prompt and 74.09% ( ) fewer to- 65850 kens in the response. As demonstrated in the evaluation of the SRD dataset, the presence of field hallucinations still results in DIN- SQL’s SQL prediction accuracy being inferior to that of MAC-SQL and ChatBI. And ChatBI continues to achieve a higher UEX with fewer tokens in the MRD dataset. The Multi-Round Dialogue Matching module is better equipped to handle MRD scenarios. The most significant change in the MRD dataset compared to the SRD dataset is that the questions are part of MRD, which introduces follow-up queries. For ChatBI, its (67.32−35.29) UEX metric in the MRD dataset decreased by 9.55% ( ) 35.29 compared to the SRD dataset, while MAC-SQL decreased by 32.51% (52.29−35.29) ( ). This indicates that, compared to using LLMs to di- 52.29 rectly handle Multi-Round Dialogues, ChatBI’s proposed Multi- Round Dialogue Matching module is more effective in managing MRD scenarios. The Phased Process Flow and Virtual Columns are more adept at handling complex semantics, computations, and comparative relationships. In the SRD dataset, it is observed that ChatBI, compared to ChatBI w/o PF + zero-shot, shows a 27.95% (74.43−58.17) ( ) improvement in UEX. Similarly, in the MRD dataset, 58.17 (67.32−52.94) ChatBI shows a 27.16% ( ) increase in UEX compared 52.94 to ChatBI w/o PF + zero-shot. We find that this is primarily because ChatBI achieves higher accuracy on SQL tasks involving complex

11. semantics, computations, and comparisons in both datasets. There- fore, the Phased Process Flow and Virtual Column prove to be more effective in managing these challenging relationships. 6.4 Discussion The limitations of the datasets primarily stem from the evalu- ation phase. As introduced, both the SRD dataset and MRD dataset contain only 153 queries composed by Database Administrators (DBAs) for Business Intelligence scenarios combined with four views. Compared to Spider [33]’s 10181 questions and BIRD [22]’s 12751 questions, the SRD and MRD datasets contain too few queries. This is mainly because, in actual NL2BI tasks, most questions do not have what is referred to as a “golden SQL”. For example, a user might ask, “How many new users were added in the last three days?” (The one of real user questions in Baidu Inc). The SQL below rep- resents the number of users added each day over the past three days: 1 2 3 4 5 6 SELECT event_day , COUNT ( DISTINCT uid ) FROM t6628 WHERE is_video_new = 1 AND event_day BETWEEN DATE_SUB ( CURRENT_DATE , INTERVAL 3 DAY ) AND DATE_SUB ( CURRENT_DATE , INTERVAL 1 DAY ) GROUP BY event_day ORDER BY event_day ASC ; , while the following SQL indicates the total number of users added over the same period: 1 2 3 4 SELECT COUNT ( DISTINCT uid ) FROM t6628 WHERE is_video_new = 1 AND event_day BETWEEN DATE_SUB ( CURRENT_DATE , INTERVAL 3 DAY ) AND DATE_SUB ( CURRENT_DATE , INTERVAL 1 DAY ) ; . Here, is_video_new represents whether a user is new, with is_video_new = 1 indicating a new user and is_video_new = 0 indicating not a new user. The uid represents the user ID, and event_day represents the date. In the BI scenario, it can be assumed that today’s data is not yet prepared. Thus, similar to the previous work [13], we use useful execution accuracy (UEX) rather than execution accuracy (EX) to evaluate, as both SQLs are considered correct in our context, whereas in other datasets, only one result would be deemed correct. Expanding the dataset size is hampered by the challenge of accu- rately assessing effectiveness, which poses a significant challenge. 7 RELATED WORK Prompting-Based NL2SQL. LLMs such as GPT-3 [6], LLaMA [4], Codex [8], PaLM [23], and Claude 3 [2] are breaking traditional prac- tices in natural language processing (NLP) communities, providing new possibilities for solving NL2SQL tasks. Inspired by chain-of- thought [30], an increasing number of studies focus on designing ef- fective prompting techniques to better utilize LLMs for various sub- tasks. For example, in the NL2SQL scenario, DIN-SQL [24], C3 [12], and SQL-PaLM [25] provide prompting techniques to GPT-4, GPT- 3.5, and PaLM respectively, greatly enhancing the accuracy of LLMs in generating NL2SQL tasks. GPT mentions in the paper that using different data structure forms, such as using the character # versus not using it, can make a significant difference. Therefore, it is nec- essary to focus on effective and novel data structures for prompting techniques. Pre-trained and Supervised Fine-Tuning-Based NL2SQL. In the domain of NLP, prior to the emergence of LLMs, the predomi- nant methodology [9–11, 15, 17–19, 27, 29, 31, 32] for addressing the NL2SQL task involved the fine-tuning of an "encoder-decoder" model. Leveraging the sequence-to-sequence paradigm, this ap- proach facilitated the modeling of NL2SQL as a transformation from natural language to the corresponding SQL output. Further- more, the advent of pre-training techniques enabled supervised fine-tuning on specific datasets, thereby augmenting the encoder’s capabilities and consequently yielding a more effective resolution of the NL2SQL task. Typically, this process necessitates a considerable volume of data pertinent to databases and SQL. Unlike methodolo- gies predicated upon large language models and prompting strate- gies, the pre-training of language models in this approach does not accommodate the guidance of thought chains during utilization, rendering it somewhat constrained. LLM Designed Specifically for NL2SQL. The recent study CodeS [21] has developed a LLM tailored specifically for the NL2SQL task, with a parameter size ranging from 1B to 15B, yielding commendable outcomes. Nonetheless, it is imperative to take into account both the costs associated with training and data acquisition. In compari- son to Supervised Fine-Tuning (SFT), this approach necessitates a more extensive dataset. Such methodologies exhibit a propensity towards specialization, diverging from the inherent versatility of LLMs (i.e., their adaptability to a multitude of subtasks). CodeS still relies on high-quality training data, which is often difficult to obtain. In the absence of sufficient training data for the NL2BI task, the data augmentation and few-shot inference approach proposed by CodeS have not been validated in this scenario. Consequently, this particular LLM type falls outside our scope of consideration. 8 CONCLUSION In this paper, we describe and evaluate ChatBI, a technology that converts Natural Language into complex SQL for Business Intel- ligence (BI) scenarios. First, ChatBI employs smaller and cheaper models compared to LLMs to match the Multi-Round Dialogue scenarios encountered by users in practice. Second, ChatBI initially leverages view technology from the database community to trans- form challenging schema linking and metric ambiguity issues in BI scenarios into a Single View Selection problem, also using smaller and cheaper models to address this problem. Finally, ChatBI first introduces a phased process flow, which decompose the NL2BI task. ChatBI then proposes Virtual Columns mapping to handle complex semantics, computations, and comparative relationships. Based on this design, ChatBI effectively understands these complex relationships within the NL2BI task. In actual production systems, ChatBI has enabled many non-expert users to perform expert-level BI analysis. ACKNOWLEDGMENTS The authors are grateful to the Baidu MEG’s MEDD team, and in particular, to Jiawei Liu, Chaoxian Gui, Yuchao Jiang, and many fellow team members.

12. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [n.d.]. Apache Superset. https://superset.apache.org/. [n.d.]. Claude3. https://www.anthropic.com/claude. [n.d.]. Erinebot. https://yiyan.baidu.com/. [n.d.]. LLama. https://llama.meta.com/. [n.d.]. LLM Ranks. https://rank.opencompass.org.cn/leaderboard-llm. [n.d.]. OpenAI GPT. https://openai.com/pricing. [n.d.]. OpenAI pricing. https://openai.com/pricing. [n.d.]. OpenAICodex. https://openai.com/blog/openai-codex. Ruichu Cai, Jinjie Yuan, Boyan Xu, and Zhifeng Hao. 2021. Sadga: Structure- aware dual graph aggregation network for text-to-sql. Advances in Neural Information Processing Systems 34 (2021), 7664–7676. Ruisheng Cao, Lu Chen, Zhi Chen, Yanbin Zhao, Su Zhu, and Kai Yu. 2021. LGESQL: line graph enhanced text-to-SQL model with mixed local and non-local relations. arXiv preprint arXiv:2106.01093 (2021). DongHyun Choi, Myeong Cheol Shin, EungGyun Kim, and Dong Ryeol Shin. 2021. Ryansql: Recursively applying sketch-based slot fillings for complex text-to- sql in cross-domain databases. Computational Linguistics 47, 2 (2021), 309–332. Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin, Dongfang Lou, et al. 2023. C3: Zero-shot text-to-sql with chatgpt. arXiv preprint arXiv:2307.07306 (2023). Avrilia Floratou, Fotis Psallidas, Fuheng Zhao, S Deep, G Hagleither, W Tan, J Cahoon, R Alotaibi, J Henkel, A Singla, et al. 2024. Nl2sql is a solved problem... not!. In 14th Annual Conference on Innovative Data Systems Research (CIDR’24). Han Fu, Chang Liu, Bin Wu, Feifei Li, Jian Tan, and Jianling Sun. 2023. Catsql: Towards real world natural language to sql applications. Proceedings of the VLDB Endowment 16, 6 (2023), 1534–1547. Yujian Gan, Xinyun Chen, Jinxia Xie, Matthew Purver, John R Woodward, John Drake, and Qiaofu Zhang. 2021. Natural SQL: Making SQL easier to infer from natural language specifications. arXiv preprint arXiv:2109.05153 (2021). Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. 2023. Text-to-sql empowered by large language models: A benchmark evaluation. arXiv preprint arXiv:2308.15363 (2023). Alex Graves and Alex Graves. 2012. Long short-term memory. Supervised sequence labelling with recurrent neural networks (2012), 37–45. Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno, and Julian Martin Eisenschlos. 2020. TaPas: Weakly supervised table parsing via pre-training. arXiv preprint arXiv:2004.02349 (2020). Wonseok Hwang, Jinyeong Yim, Seunghyun Park, and Minjoon Seo. 2019. A comprehensive exploration on wikisql with table-aware word contextualization. arXiv preprint arXiv:1902.01069 (2019). Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. 2022. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406 (2022). Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, and Hong Chen. 2024. CodeS: Towards Building [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] Open-source Language Models for Text-to-SQL. arXiv preprint arXiv:2402.16347 (2024). Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. 2024. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems 36 (2024). Sharan Narang and Aakanksha Chowdhery. 2022. Pathways language model (palm): Scaling to 540 billion parameters for breakthrough performance. Google AI Blog (2022). Mohammadreza Pourreza and Davood Rafiei. 2024. Din-sql: Decomposed in- context learning of text-to-sql with self-correction. Advances in Neural Informa- tion Processing Systems 36 (2024). Ruoxi Sun, Sercan O Arik, Hootan Nakhost, Hanjun Dai, Rajarishi Sinha, Pengcheng Yin, and Tomas Pfister. 2023. Sql-palm: Improved large language modeladaptation for text-to-sql. arXiv preprint arXiv:2306.00739 (2023). Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. Ernie: Enhanced representa- tion through knowledge integration. arXiv preprint arXiv:1904.09223 (2019). Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014). Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Qian-Wen Zhang, Zhao Yan, and Zhoujun Li. 2023. Mac-sql: Multi-agent collaboration for text-to- sql. arXiv preprint arXiv:2312.11242 (2023). Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2019. Rat-sql: Relation-aware schema encoding and linking for text-to-sql parsers. arXiv preprint arXiv:1911.04942 (2019). Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason- ing in large language models. Advances in neural information processing systems 35 (2022), 24824–24837. Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for joint understanding of textual and tabular data. arXiv preprint arXiv:2005.08314 (2020). Tao Yu, Chien-Sheng Wu, Xi Victoria Lin, Bailin Wang, Yi Chern Tan, Xinyi Yang, Dragomir Radev, Richard Socher, and Caiming Xiong. 2020. Grappa: Grammar-augmented pre-training for table semantic parsing. arXiv preprint arXiv:2009.13845 (2020). Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887 (2018). Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. 2022. Least-to- most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625 (2022).