𝐶ℎ𝑎𝑡𝐵𝐼- Towards Natural Language to Complex Business Intelligence SQL
如果无法正常显示,请先停止浏览器的去广告插件。
1. 𝐶ℎ𝑎𝑡𝐵𝐼 : Towards Natural Language to Complex Business
Intelligence SQL
Jinqing Lian ★ , Xinyi Liu § , Yingxia Shao ★∗ , Yang Dong § , Ming Wang § ,Wei Zhang § ,
Tianqi Wan § , Ming Dong §∗ , Hailin Yan §
Beijing University of Posts and Telecommunications
§ Baidu Inc.
{jinqinglian,shaoyx}@bupt.edu.cn,{liuxinyi02,dongyang06,wangming15,zhangwei143,wantianqi,dongming03,yanhailin}@baidu.com
★
ABSTRACT 1
The Natural Language to SQL (NL2SQL) technology provides non-
expert users who are unfamiliar with databases the opportunity to
use SQL for data analysis. Converting Natural Language to Business
Intelligence (NL2BI) is a popular practical scenario for NL2SQL in
actual production systems. Compared to NL2SQL, NL2BI introduces
more challenges.
In this paper, we propose ChatBI, a comprehensive and efficient
technology for solving the NL2BI task. First, we analyze the in-
teraction mode, an important module where NL2SQL and NL2BI
differ in use, and design a smaller and cheaper model to match this
interaction mode. In BI scenarios, tables contain a huge number of
columns, making it impossible for existing NL2SQL methods that
rely on Large Language Models (LLMs) for schema linking to pro-
ceed due to token limitations. The higher proportion of ambiguous
columns in BI scenarios also makes schema linking difficult. ChatBI
combines existing view technology in the database community to
first decompose the schema linking problem into a Single View
Selection problem and then uses a smaller and cheaper machine
learning model to select the single view with a significantly re-
duced number of columns. The columns of this single view are then
passed as the required columns for schema linking into the LLM.
Finally, ChatBI proposes a phased process flow different from exist-
ing process flows, which allows ChatBI to generate SQL containing
complex semantics and comparison relations more accurately.
We have deployed ChatBI on Baidu’s data platform and inte-
grated it into multiple product lines for large-scale production
task evaluation. The obtained results highlight its superiority in
practicality, versatility, and efficiency. At the same time, compared
with the current mainstream NL2SQL technology under our real BI
scenario data tables and queries, it also achieved the best results. The rapid development of LLMs has attracted the attention of re-
searchers in both academia and industry, with researchers from
the Natural Language Processing (NLP) and database communities
hoping to use LLMs to solve the NL2SQL task. The NL2SQL task
involves converting user Natural Language into SQL that can be ex-
ecuted by a database, with the SQL semantics being consistent with
the user’s intent. Today, thousands of organizations, such as Google,
Microsoft, Amazon, Meta, Oracle, Snowflake, Databricks, Baidu,
and Alibaba, use Business Intelligence (BI) for decision support.
Consequently, researchers representing the industry have quickly
focused on the most attention-grabbing scenario of the NL2SQL
task in actual production systems: the NL2BI task, which involves
converting Natural Language into BI through technology. Through
NL2BI technology, many non-expert users who do not understand
databases, such as product managers or operations personnel, can
perform data analysis, thereby aiding decision-making.
Figure 1 reflects the difference between the NL2BI and NL2SQL
tasks. In actual production systems, it is humans who call the NL2BI
tool, and the most significant characteristic of humans using the
tool is interactivity, i.e., Multi-Round Dialogue (MRD) scenarios.
For example, the three-round dialogue scenario shown in Figure 1.
The Query1 indicates “the user’s query for the column of short video
playback volume over the past seven days”. Both NL2SQL and NL2BI
technologies can recognize the user’s intent and generate the cor-
responding executable correct SQL. In the Query2, the user asks
about the common query in BI scenarios, “the Week-on-Week com-
parison”, which corresponds to complex semantics and comparison
relationships. The presence of complex semantics and comparison
relationships makes the problem of generating SQL difficult, and
existing NL2SQL technology cannot generate the SQL required for
BI well.
Existing LLMs-based NL2SQL technology only matches Single-
Round Dialogue (SRD) queries, and when there is no prompt to
show that the next question is a MRD, the LLM cannot recognize
the intent of the MRD well. Therefore, for the Query3 changing the
intent of the column, the existing NL2SQL method cannot recognize
that the user’s intent is to query “the playback duration of short
videos over the past seven days and the Week-on-Week comparison”.
Therefore, the existing NL2SQL method still cannot generate the
SQL corresponding to the intent of the Query3. In addition to the
different interaction modes brought by actual scenarios, data tables
under BI scenarios often contain hundreds of columns, most of
PVLDB Reference Format:
XXX. 𝐶ℎ𝑎𝑡𝐵𝐼 : Towards Natural Language to Complex Business
Intelligence SQL. PVLDB, (): , .
doi:
This work is licensed under the Creative Commons BY-NC-ND 4.0 International
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. , No. ISSN 2150-8097.
doi:
*Ming Dong and Yingxia Shao are the corresponding authors.
INTRODUCTION
2. video.t6628 (From Tables)
Column Description
event_day Date
sy_vv Short Video Views
playtime Play time (seconds)
…… (20+ lefts) Small numbers
NL2SQL
SELECT DATE(event_day) AS `xc_1`, sum(sv_vv) AS `mt_1`
FROM video.t6628
WHERE event_day >= '2024-04-04’
AND event_day <= '2024-04-10’
GROUP BY DATE(event_day)
ORDER BY `xc_1` ASC, `mt_1` DESC
LIMIT 10000;
ERROR SQL
ERROR SQL
Complex comparisons
Not support Multi-Round Dialogue
Query1: The short video
playback volume for
the past seven days
User
Query2: What about
the week-on-week
comparison?
Chat 1
SELECT DATE(event_day) AS `xc_1`, sum(sv_vv) AS `mt_1`
FROM video.t6628
WHERE event_day >= '2024-04-04’
AND event_day <= '2024-04-10’
GROUP BY DATE(event_day)
ORDER BY `xc_1` ASC, `mt_1` DESC
LIMIT 10000;
video.t6628 (From Views)
Column Description
event_day Date
sv_vv Short Video Views
playtime Play time (seconds)
…… (200+ lefts) Huge numbers
Chat 2
Query3: What about
the playback duration?
Chat 3
SELECT table_origin.`xc1` AS `xc1`, table_origin.`mt_1` AS `mt_1`,
(table_origin.`mt_1` - table_1Week.`mt_1`) / table_1Week.`mt_1` AS `mt_2`
FROM
(SELECT DATE(event_day) AS `xc1`, sum(sv_vv) AS `mt_1`, 1 AS a
FROM video.t6628
WHERE event_day >= '2024-04-04’
AND event_day <= '2024-04-10’
GROUP BY DATE(event_day)) AS table_origin
LEFT OUTER JOIN
(SELECT DATE(DATE_ADD(DATE(event_day), INTERVAL 1 Week)) AS
`xc1`, sum(sv_vv) AS `mt_1`
FROM video.t6628
WHERE event_day >= '2024-03-28’
AND event_day <= '2024-04-03’
GROUP BY DATE(DATE_ADD(DATE(event_day), INTERVAL 1 Week)))
AS `table_1Week`
ON table_origin.`xc1` = table_1Week.`xc1`
ORDER BY `xc1` ASC, `mt_1` DESC
LIMIT 10000;
NL2BI
Multi-Round Dialogue
SELECT table_origin.`xc_1` AS `xc_1`, table_origin.`mt_1` AS `mt_1`,
table_origin.`mt_2` AS `mt_2`, (table_origin.`mt_1` - table_1Week.`mt_1`) /
table_1Week.`mt_1` AS `mt_3`, (table_origin.`mt_2` - table_1Week.`mt_2`) /
table_1Week.`mt_2` AS `mt_4`
FROM
(SELECT DATE(event_day) AS `xc_1`, sum(sv_vv) AS `mt_1`,
sum(playtime)/60 AS `mt_2`, 1 AS a
FROM video.t6628
WHERE event_day >= '2024-04-04’
AND event_day <= '2024-04-10’
GROUP BY DATE(event_day)) AS table_origin
LEFT OUTER JOIN
(SELECT DATE(DATE_ADD(DATE(event_day), INTERVAL 1 Week)) AS
`xc_1`, sum(sv_vv) AS `mt_1`, sum(playtime)/60 AS `mt_2`
FROM video.t6628
WHERE event_day >= '2024-03-28’
AND event_day <= '2024-04-03’
GROUP BY DATE(DATE_ADD(DATE(event_day), INTERVAL 1 Week)))
AS `table_1Week`
ON table_origin.`xc_1` = table_1Week.`xc_1`
ORDER BY `xc_1` ASC, `mt_1` DESC, `mt_2` DESC
LIMIT 10000;
Figure 1: The difference between NL2SQL and NL2BI problems. The question consists of multiple rounds of dialogue, Query1:
“The short video playback volume for the past seven days”, Query2: “What about the week-on-week comparison?”, Query3:
“What about the playback duration?”. The NL2BI problem requires the ability to handle complex semantics, comparisons and
calculation relationships, as well as multi-round dialogue. And the data sources for the two problems are also different.
which exist in views, while tables in NL2SQL usually contain dozens
of columns, most of which come from real physical tables [13].
Researchers have not put much effort into NL2BI, but they have
invested considerable effort in NL2SQL. Existing methods can be di-
vided into three categories. The first category is pre-trained and Su-
pervised Fine-Tuning (SFT) methods [9–11, 15, 17–19, 27, 29, 31, 32],
which fine-tune an “encoder-decoder” model and use a sequence-to-
sequence pattern to convert Natural Language into SQL. This was
the mainstream approach before the advent of LLMs. The second
category is prompt engineering based LLMs. Inspired by chain-
of-thought [30] and least-to-most [34], more and more research
focuses on designing effective prompt engineering to make LLMs
perform better in various subtasks. For example, in the NL2SQL sce-
nario, DIN-SQL [24], C3 [12], and SQL-PaLM [25] provide prompt
engineering to GPT-4 [6], GPT-3.5 [6], and PaLM [23], respectively,
to greatly enhance the accuracy of generating SQL from Natu-
ral Language. The third category is LLMs specifically trained for
NL2SQL, which use domain knowledge in NL2SQL to train special-
ized LLMs. However, when directly applying these methods to real
analysis tasks in the BI scenario, we encountered some problems
from the following perspectives:
C1: Limited interaction modes. The MRD interaction scenario
shown in Figure 1 considers the interactivity of users using the
NL2BI method, while the vast majority of existing NL2SQL methods,
due to the evaluation of the open-source leaderboards Spider [33]
Table 1: Examples of ambiguous columns in BIRD dataset.
db_id
ice_hockey_draft
ice_hockey_draft
hockey
hockey
language_corpus
language_corpus
column_name
sum_7yr_GP
sum_7yr_TOI
hOTL
PPG
w1st
w2nd
description
sum 7-year game plays
sum 7-year time on ice
Home overtime losses
Power play goals
word id of the first word
word id of the second word
and BIRD [22], conduct queries in a SRD mode, neglecting MRD
scenarios and limiting user interaction modes. For example, for
the prompt engineering based LLMs such as DIN-SQL [24], MAC-
SQL [28], and DAIL-SQL [16], the prompt engineering is organized
in a SRD mode. Simply adding MRD prompts to the prompt en-
gineering does not match MRD scenarios. The most challenging
aspect is that the LLM needs to first determine whether a given
sentence is part of a MRD or a SRD. For example, Query1 can be
considered the SRD mode because it can generate SQL that cor-
responds to the intended meaning. However, Query2 cannot be
considered the SRD mode, and it requires combination with Query1
to generate SQL that meets the intended meaning. Therefore, there
is a need for an accurate MRD matching method that can satisfy
the user interactivity.
3. C2: Large number of columns and ambiguous columns. In
the NL2BI scenario, SQL mainly consists of Online Analytical Pro-
cessing (OLAP) queries. From a timeliness perspective, this leads
to tables in BI scenarios often undergoing Extract-Transform-Load
(ETL) processing first, incorporating a large number of columns
into the same table to avoid join queries. From the perspective of
actual storage, this new table containing a large number of columns
often exists in the form of views. For example, one of Microsoft’s
internal financial data warehouses consists of 632 tables contain-
ing more than 4000 columns and 200 views with more than 7400
columns. Similarly, another commercial dataset tracking product us-
age consists of 2281 tables containing a total of 65000 columns and
1600 views with almost 50000 columns [13]. The large number of
columns poses a direct challenge to methods that rely on LLMs for
schema linking, such as DIN-SQL [24], MAC-SQL [28], and DAIL-
SQL [16]. When columns are input as tokens, it causes the number
of tokens to exceed the limit, and simply truncating columns leads
to a sharp drop in the accuracy of generated SQL [13]. At the same
time, the proportion of ambiguous columns in BI scenarios has
also expanded. Table 1 shows the columns ambiguity in the BIRD
dataset, which accounts for a small proportion. However, in BI
scenarios, column ambiguity accounts for a larger proportion. For
example, the “uv” can be represented by the “User view”, “Unique
visitor numbers”, and “Distinct number of visits”. The presence of
column ambiguity will increase the likelihood of hallucinations in
LLMs.
C3: Insufficiency of the existing process flows. The existing
process flows use LLMs to directly generate SQL. For example,
MAC-SQL [28] shows accuracies of 65.73%, 52.69%, and 40.28%
for Simple, Moderate, and Challenging tasks, respectively, on
the BIRD dataset. DAIL-SQL [16] also has similar results. How-
ever, SQL tasks in BI scenarios often contain complex semantics,
comparisons 1 , and calculation relationships 2 , making the tasks
usually Challenging+. Advanced LLMs demonstrate superior per-
formance on NL2SQL tasks [13], albeit with higher costs (GPT-4-32K
is about 60X more expensive than GPT-3.5-turbo-instruct [7] at the
time of this writing). Most current process flows employ advanced
LLMs such as GPT-4 [6]. However, GPT-4 has not yet opened an
entry for SFT, limiting existing approaches to handling complex
semantics, computations, and comparison relationships through
prompt. Longer prompts tend to yield poorer results [13], signifi-
cantly constraining the understanding of the aforementioned chal-
lenges within the existing process flows. This means that existing
LLMs-based NL2SQL methods cannot effectively handle complex
semantics, calculations, and comparison relationships in BI SQL.
This reflects the insufficiency of the existing process flows, where
all existing methods use LLMs to understand challenging complex
semantics, comparisons, and calculation relationships. In BI scenar-
ios, calculation relationships are easy to list (such as Week-on-Week
and Year-on-Year comparisons) but difficult for LLMs to directly
perceive. Therefore, there is an opportunity to find a new process
flow that can better handle complex semantics, comparisons, and
calculation relationships.
1 Week-on-week
comparisons in Figure 1.
- table_1Week.‘mt_1’) / table_1Week.‘mt_1’ AS ‘mt_2’ in Figure 1.
2 (table_origin.‘mt_1’
Our approach. To address the challenges described above, we
introduce ChatBI for solving the NL2BI task within actual produc-
tion systems. For C1, unlike previous methods that focus solely
on SRD scenarios, we treat the matching of MRD scenarios as a
crucial module and identifies corresponding features, columns, and
dimensions. We employ two smaller and cheaper Bert-like mod-
els to first conduct text classification and then text prediction for
MRD matching. Compared to the method of inputting all dialogues
as a sequence for prediction, ChatBI introduces a recent matching
mode, which shows improved performance. For C2, we initially
integrate mature view technology from the database community,
converting the schema linking problem into a Single View Selec-
tion problem. A smaller and cheaper single view selection model
is used to select a single view, and all columns within this view
are treated as the schema. The ability of views to retain a single
column for ambiguous columns effectively addresses the problem
of column ambiguity. Lastly, for C3, since the existing process flows
predominantly utilize GPT-4, which cannot perform SFT and must
rely solely on prompts, leading to a failure to understand complex
semantics, comparisons, and computational relationships, we have
designed a phased process flow and introduced virtual columns. Us-
ing this innovative process flow and virtual columns, we decouple
LLMs from the understanding of complex semantics, comparisons,
and computational relationships, and SQL language itself, thereby
increasing their accuracy by reducing the task difficulty faced by
the LLMs. In the new process flow, we position the SQL generation
step after the LLM using existing rule-based methods, thus avoiding
most errors caused by the hallucinations of LLMs.
Specifically, we make the following contributions:
• Considering the MRD scenario that objectively exists in the
real-world use of NL2BL technology, we propose and solve
the MRD problem in the NL2BI scenario.
• Unlike previous methods that directly use LLMs for schema
linking, we transform schema linking into a single view
selection problem using view technology from the database
community. We then use the columns corresponding to the
single view as the selected columns input to the LLM.
• Unlike previous methods that directly use LLMs to gener-
ate SQL, we propose a phased process flow. This process
outputs structured intermediate results, decoupling some
content that the LLM needs to perceive from the original
process. Complex semantics, calculations, and comparison
relationships are handled through structured intermediate
results using mapping and other methods. The structured
nature of SQL makes the new process more effective.
• We have deployed the proposed method in a production
environment, integrated it into multiple product lines, and
compared it with mainstream NL2SQL methods. ChatBI has
achieved optimal results in terms of accuracy and efficiency.
In the following sections, we will provide details of ChatBI. In
section 2, we present necessary backgrounds. Section 3 introduces
the overview of ChatBI. We then introduce the Multi-Round Di-
alogues Matching and Single View Selection in Section 4 and
the Phased Process Flow in Section 5. We evaluate our method in
Section 6. Finally, we conclude this paper in Section 7.
4. Natural language question
Q: The short video
playback volume for the
past seven days?
NL2SQL method
SQL
SELECT DATE(event_day) AS `xc_1`, sum(sv_vv)
AS `mt_1`
FROM video.t6628
WHERE event_day >= '2024-04-04’
AND event_day <= '2024-04-10’
GROUP BY DATE(event_day)
ORDER BY `xc_1` ASC, `mt_1` DESC
LIMIT 10000;
Database (Video)
Natural language question
Q1: The short video playback volume for
the past seven days?
Schema (4 tables or 4 views)
t6628:
event_day
(74 columns) sv_vv playtime …
t6847:
event_day
(55 columns) uid dau …
t6867:
event_day
(35 columns) vid click_pv …
t7479:
event_day
(62 columns) cuid uv …
Metadata (types, description)
{
“t6628.event_day”:{“type”: “DATE”,
“description”: ”Date”},
“t6628.sv_vv”:{“type”: “DECIMAL”, “description”:
“Short video distribution volume”},
“t6628.playtime”:{“type”: ”DECIMAL”, “description”:
“Total playback duration (seconds)”},
…
“t6847.uid”:{“type”: “VARCHAR”,
“description”: ”Logged-in user ID”},
“t6847.dau”: {“type”: “Virtual Metric”, “description”:
“count(distinct if (uid is not null, cuid, null))”},
…
}
Q2: What about the week-on-week
comparison?
NL2BI method
SQL
SELECT table_origin.`xc_1` AS `xc_1`, table_origin.`mt_1` AS `mt_1`,
table_origin.`mt_2` AS `mt_2`, (table_origin.`mt_1` - table_1Week.`mt_1`) /
table_1Week.`mt_1` AS `mt_3`, (table_origin.`mt_2` - table_1Week.`mt_2`) /
table_1Week.`mt_2` AS `mt_4`
FROM
(SELECT DATE(event_day) AS `xc_1`, sum(sv_vv) AS `mt_1`,
sum(playtime)/60 AS `mt_2`, 1 AS a
FROM video.t6628
WHERE event_day >= '2024-04-04’
AND event_day <= '2024-04-10’
GROUP BY DATE(event_day)) AS table_origin
LEFT OUTER JOIN
(SELECT DATE(DATE_ADD(DATE(event_day), INTERVAL 1 Week))
AS `xc_1`, sum(sv_vv) AS `mt_1`, sum(playtime)/60 AS `mt_2`
FROM video.t6628
WHERE event_day >= '2024-03-28’
AND event_day <= '2024-04-03’
GROUP BY DATE(DATE_ADD(DATE(event_day), INTERVAL 1
Week))) AS `table_1Week`
ON table_origin.`xc_1` = table_1Week.`xc_1`
ORDER BY `xc_1` ASC, `mt_1` DESC, `mt_2` DESC
LIMIT 10000;
Figure 2: An example of metadata in NL2SQL and NL2BI.
“dau” and “uv” are Virtual Columns.
2
PRELIMINARY
NL2SQL Task. The primary goal of the NL2SQL task is to construct
a SQL query S from a given Natural Language question Q, a database
D and a corresponding database table T. The S should be executable
and accurately address the informational intent encapsulated in Q,
thereby producing a valid response within the context provided by
D and T :
𝑆 = 𝑃𝑎𝑟𝑠𝑒𝑟 (𝑄, 𝐷,𝑇 ),
(1)
where the Parser() is designed to analyze Q leveraging D and T. D
contains database schema and database metadata, while T contains
database metadata which contains column types and descriptions.
An example of the metadata is presented in Figure 2.
NL2BI Task. The primary goal of the NL2BI task is to construct a
SQL query S from given Multi-Round Natural Language question
MRQ, a database D and a corresponding database view V :
𝑆 = 𝑃𝑎𝑟𝑠𝑒𝑟 (𝑀𝑄, 𝐷, 𝑉 ),
(2)
Virtual Column. Virtual columns are columns that are typically
calculated in BI scenarios and their corresponding rules, which
do not physically exist in tables or views. As shown in Figure 2,
within table or view “t6847”, there is a column named “uid”, which
represents the logged-in user ID. In the BI scenario, DAU is a crucial
column, standing for Daily Active Users. It signifies the number of
active users per day. In “t6847”, “dau” exists as an virtual column.
The description associated with “t6847.dau” is “count( distinct if(uid
is not null, uid, null) )”, which specifies the computation syntax for
obtaining DAU.
3
SYSTEM OVERVIEW
Figure 3 presents an overview of ChatBI, which comprises three
modules: Multi-Round Dialogues Matching, Single View Selec-
tion, and a Phased Process Flow. Specifically, after a user poses a
Natural Language question, the Multi-Round Dialogues Matching
module determines whether the question is a follow-up in a Multi-
Round dialogue or a standalone query in a Single-Round dialogue
to ensure the question encapsulates the user’s complete intent. The
Single View Selection module identifies the corresponding view
that matches the user’s intent based on the question containing the
full intent, and organizes this corresponding view and the question
into prompts that are passed to a LLM. Unlike previous process
flows that directly generate the target SQL from these prompts, the
phased prcess flow first requires the LLM to output a JSON interme-
diary. This intermediary serves as the input for a rule-based SQL
generation method, producing the target SQL. Existing rule-based
methods struggle with complex semantics, computations, and com-
parison relationships. Therefore, ChatBI introduces a method of
Virtual Column in the phased process flow. This method avoids the
need for the LLM to learn complex semantics, computations, and
comparison relationships, reducing the difficulty of the problem-
solving process for the LLM in the phased process flow. Finally,
with the help of Apache SuperSet [1], ChatBI enables the output of
custom results in various formats (e.g., charts, line graphs). After
each query is executed, it is evaluated by users to determine if the
generated SQL resolves the query. If resolved, it is counted in the
View Advisor. And if not, it undergoes analysis, and cases where
the failure is due to the absence of a view, prompting the Single
View Selection module to fail, are assessed by the DBA to decide
if a new view should be established. This ensures the accuracy of
the Single View Selection module. Based on the counts, the DBA
can also dynamically delete seldom-used views, achieving dynamic
management of views.
4
BALANCES ECONOMY AND ACCURACY.
In this section, we first discuss Multi-Round Dialogue matching
in Section 4.1, which addresses the interactivity of user queries.
Subsequently, in section 4.2, we introduce the Single View Selection
method designed to meet the challenges posed by the large number
of columns and column ambiguities in BI scenarios.
4.1
Multi-Round Dialogue Matching
In BI scenarios, dimensions and columns are the two most crucial
types of features. Dimensions typically correspond to statistical
times or granularity, such as “the past three days” representing
a time dimension, while columns correspond to specific columns,
such as the “User visitors” column representing the number of
user visits. Typically, when a query contains both dimensions and
columns, such as the query “User visits in the Beijing city yester-
day”, it is possible to generate the corresponding SQL, and such a
query can be considered as the current query. However, when a
query lacks either a dimension or a column, for example, “What
about Tianjin?”, it cannot generate the corresponding SQL, thus
it is necessary to treat it as part of a MRD query, using historical
queries as sequence input to predict the current query. As shown in
Figure 4, ChatBI splits the MRD matching problem into two tasks:
a classification task and a prediction task.
Classification Task. From the perspective of the timeliness of calls
in actual production systems, ChatBI uses a Pre-training model
rather than a LLM to perform this text classification task. In text
classification tasks, the State-of-the-art (SOTA) Pre-training method
is Bert [31]. In our production system, where the user input is in
Chinese, we use the ERNIE [26] model, which is a Bert-like model
that performs better on Chinese NLP tasks. The effectiveness of
this classification model is demonstrated by the following formula
5. Rounds ↑
0
Multi-Round
Dialogues
Matching
1
1 0
0 1
0 0
0
1
1
Input
Tokens
…
Field
Field
SFT
ERNIE
(Tiny)
Field
…
Field
LLMs
1
0 1
0 1
Input Questions
View
Advisor
… …
Views Tables
JnM (Json nested Map)
Phased
Process
Flow
0
0
Tables and Views
Target View
Single
View
Selection
1
Append
Predicted Questions
Questions
Users
1
SFT
ERNIE
(Tiny)
Json Example
{
“Key1” : [List1],
…
“KeyN” : “ValueN”
}
Map
DBA
Virtual Column
Key1 Metric1
Key2 Metric2
… …
KeyN MetricN
Apache
SuperSet
Results
Users
Figure 3: The Overview of ChatBI.
Rounds ↑
0
Multi-Round
Dialogues
Matching
1
Users
1 0
0 1
0 0
1
SFT
ERNIE
(Tiny)
0
1
Append
1
1
0
Predicted Questions
Questions
0 1
0 1
0
1
Input Questions
Prediction Task
Classification Task
: Query before classification
0 : Query that lacks dimension or metric.
1 : Query that contains both dimension and metric.
Q: Yesterday Beijing’s User visitors.
Q: What about Tianjin?
Q: What was the DAU yesterday?
SFT
ERNIE
(Tiny)
1
: The predicted Query that contains both dimension and metric.
Recent matching mode
1
1 1 0 0
0 1 1 0
0
SFT
ERNIE
(Tiny)
1
1
1
Figure 4: The method used in MRD Matching.
𝐼 (𝑄):
{︄
𝐼 (𝑄) =
1 𝑖 𝑓 𝑄 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑠 𝑏𝑜𝑡ℎ 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠 𝑎𝑛𝑑 𝑐𝑜𝑙𝑢𝑚𝑛𝑠,
0 𝑖 𝑓 𝑄 𝑙𝑎𝑐𝑘𝑠 𝑒𝑖𝑡ℎ𝑒𝑟 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠 𝑜𝑟 𝑐𝑜𝑙𝑢𝑚𝑛𝑠.
(3)
Next, we detail the specifics of the Pre-training dataset. We em-
ploy a straightforward method to construct the Pre-training corpus.
Firstly, for Q labeled as 1, which includes queries containing both di-
mensions and columns, the distinction between whether a column
in a database table or view is a dimension or column is quite clear.
For instance, “event_day” typically represents a time dimension,
and “app_id” often represents different business lines, also a dimen-
sion. Columns, such as “sv_vv” which represents the distribution
volume of short videos, and “dau” which represents daily active
users, are clearly defined as such. Simply concatenating dimension
and column can generate Q labeled as 1, such as “event_day + sv_vv”
could produce the training data: “What is the volume of short video
plays over the past three days?”. For Q labeled as 0, it is natural to
create data containing only one of dimensions or columns by re-
moving either dimension or column from data labeled as 1. For data
that contains neither dimensions nor columns, we utilize GPT-4 for
generation. The training model employs a 4-layer transfer_former
ERNIE (Tiny) model, which first classifies each user query.
Prediction Task. Starting from the perspective of timeliness, ChatBI
does not directly use LLMs for prediction. Queries labeled as 1 by the
classification model can be directly treated as current queries, while
those labeled as 0 require processing to become current queries.
An initial idea is to use all of a user’s historical queries as sequence
input to predict queries that meet the criteria of containing both
dimensions and columns. This approach has been found to be inef-
fective (details discussed in Section 6). We observe that user queries
mostly have logical coherence and often follow up on previous
queries, so the most recent queries labeled as 0 are typically supple-
ments or follow-ups to the most recent queries labeled as 1. Based
on this observation, we have designed a recent matching mode: for a
query, we identify the timestamp of the most recent query classified
as 1 in the user’s historical inputs, and from this timestamp to the
current moment, we take all queries posed by the user as a sequence
to a 4-layer transfer_former ERNIE (Tiny) for prediction, treating
the predicted query as the current query.
In summary, Algorithm 1 illustrates the main procedures of the
MRDs matching algorithm. For the user’s MRDs, when the latest
dialogue contains both dimensions and columns, it can be directly
used as the current query input (Line 10). Otherwise, we select the
6. Algorithm 1 MRDs Matching Algorithm
Input: MRDs queries {𝑄 1 , 𝑄 2 , ..., 𝑄 𝑁 } (𝑄 𝑁 is the latest query)
Output: A query 𝑄 is labeled as one
1: if 𝐼 (𝑄 𝑁 ) is Zero then
2:
𝐼 𝐷 ← -1
3:
for 𝑖 ← 𝑁 , 𝑁 − 1, . . . , 2, 1 do
4:
if 𝐼 (𝑄 𝑖 ) is One then
5:
𝐼𝐷 ← i
6:
break
7:
end if
8:
end for
9: else
10:
Return 𝑄 𝑁
11: end if
12: if 𝐼 𝐷 ≠ -1 then
13:
Use {𝑄 𝐼 𝐷 , 𝑄 𝐼 𝐷+1 , . . . , 𝑄 𝑁 } as the inputs, and use the SFT
ERNIE Model to predict the 𝑄 𝑁 +1
14:
Append 𝑄 𝑁 +1 to MRDs queries
15: else
16:
Return None
17: end if
18: Return 𝑄 𝑁 +1
nearest dialogue with label one as the starting point and the latest
dialogue as the endpoint (Line 3 - Line 8), input all dialogues in the
interval into the ERNIE model to predict dialogues with label one
(Line 13), and directly add them to the MRDs (Line 14). The user’s
current query is used as the Natural Language input required by
the single view selection.
4.2
Single View Selection
Schema linking is the process of selecting the target columns re-
quired by a Natural Language from all the columns in the data
tables. This task presents a heightened challenge in BI scenarios
due to the large number of columns, in contrast to ordinary sce-
narios. In previous work [13], LLMs were limited by the number
of tokens, preventing them from processing tables with large num-
ber of columns, and the longer the prompt, the more severe the
degradation of the model’s capabilities. Additionally, this work [13]
mentions that the optimal method for schema linking is to provide
only the optimal tables and columns (29.7%), followed by providing
only the optimal tables (24.3%), and then the LLM’s own selec-
tion (21.1%). Advanced LLMs demonstrate superior performance
on NL2SQL tasks [13], albeit with higher costs (GPT-4-32K is about
60X more expensive than GPT-3.5-turbo-instruct [7] at the time of
this writing). For DIN-SQL [24], generating a single SQL query costs
$0.5, which is unacceptable in actual production systems. Therefore,
both from the perspective of reducing the number of tokens and
solving the schema linking problem, using a smaller and cheaper
model to provide the optimal tables and columns has become the
optimal choice.
Meanwhile, it is noteworthy that Views, despite their mature
development in data warehouses, data mining, data visualiza-
tion, decision support systems, and BI systems, are absent in
GPT-4 Instruction
As a data analyst, you currently have access to a data table with dimensions: [Date, Opening Method, Search Type, Baidu
App Matrix Product, Operating System], and statistical indicators: [Search DAU, Total Search Traffic, Total Distribution
Volume from First Search, Number of New Users from Search]. The dimension [Operating System] includes the
enumeration values: [Android, iOS, Other]; [Opening Method] includes: [push, proactive opening, other, overall, launch];
[Search Type] includes: [new search, traditional search]; [Baidu App Product] includes: [Baidu App main version, Baidu
App Lite version, Baidu App large font version, Honor white-label, overall]. The indicator field [Search DAU] is also known
as [Daily Active Users for Search, Daily Active Search], and [Total Search Traffic] is also known as [Search Traffic, Search
Count, Number of Searches].
Based on the dimensions and statistical indicators of this table, generate 20 data query questions. Your questions must
be limited to the table's available dimensions and indicators, must include the dimensions [Opening Method, Date], and
be as concise as possible. You may use the dimension's enumeration values and the aliases of the indicator fields.
Q1: How many Daily Active Users for Search were there on [specific date] when the app was proactively opened?
Q2: What was the Total Search Traffic on [specific date] for proactive openings?
...
Q19: What was the total number of searches on [specific date] for new searches launched by other methods?
Q20: How many searches were conducted on [specific date] using proactive openings in the Honor white-label version?
Figure 5: The prompt for SFT training data generation using
GPT-4.
the Spider [33] and BIRD [22] datasets. These Views are estab-
lished upon business access and exist objectively within real data
tables. The presence of Views substantially mitigates the need for
JOIN operations between tables, enabling the execution of queries
that would otherwise be dispersed across multiple tables on a single
table. And if there exists an optimal single view that contains the
optimal columns, the view is the answer. Otherwise, there may be
two or more views that require a join operation to be resolved or
other errors. We report all such join scenarios to the DBA in View
Adviosr, who evaluates the proportion of these types of queries and
decides whether to establish views for such queries to avoid join
operations. Consequently, in the context of NL2BI tasks, an optimal
strategy for decomposing the problem is to convert the schema
linking issue into a Single View Selection problem.
Given multiple Views Vs, each containing numerous columns
that genuinely exist in the data tables without duplication, and each
column potentially appearing in zero or more Vs, the objective is to
select a View V that encompasses all the target columns Cols corre-
sponding to the user’s Natural Language query while minimizing
the total number of columns in the V. The Single View Selection
problem is defined as follows:
𝑉 = 𝑆𝑒𝑙𝑒𝑐𝑡 (𝑉 𝑠, 𝐶𝑜𝑙𝑠),
(4)
where the Select() is designed to find a V that contains Cols in Vs.
The abundance of Cols in BI scenarios poses a challenge when
inputting all schemas as tokens into LLMs, leading to the issue of
token number exceeding the limit. Furthermore, the same column
across different tables in the same business may be expressed in
multiple ways, resulting in column ambiguity. Addressing the Sin-
gle View Selection problem can significantly compress the token
count of the schema linking results, enabling the LLMs to accom-
modate schemas. The logical matching relationships established
during V creation effectively resolve column ambiguity.
Considering online invocation and economic benefits, we de-
cide to first adopt smaller and cheaper models and assess whether
they can effectively complete view selection. In line with the initial
approach of not employing a LLM to tackle MRD scenarios, we
designed and compared three lightweight models for practical appli-
cation: a Convolutional Neural Network (CNN ) model (1 layer and
7. Table 2: The accuracy of each view selection model.
Stage 1
NL2SQL Technology’s Process Flow
Database metadata + Question
(Query)
Baseline
SFT ERNIE (4 layers)
Bert (12 layers)
CNN (1 layer, 3-channel)
ACC
95.7%
95.2%
83.9%
ChatBI’s Process Flow
NL2SQL LLMs
Stage 1
Database metadata + Question
(Query)
3-channel), a Bert model (12 layers) and an SFT ERNIE model (4 lay-
ers) for view selection data. The training data, generated as shown
in Figure 5, specifically comprises real Views and their included
Cols, a capability inherent to BI scenarios. We use this method to
generate over 3,000 entries as training data and 500 entries as test
data. We then validate three models with the results presented in
Table 2. Based on the comparison of experimental results, the SFT
ERNIE Tiny model was ultimately selected.
ChatBI also maintains a View Advisor module within its Single
View Selection. After ChatBI finishes running, all queries that fail
the single view selection are collected into the View Advisor and
handed over to the business line’s DBA for evaluation. The DBA
assesses the proportion of such queries in user scenarios and decides
whether to establish views to support such queries and avoid join
operations. Meanwhile, all queries that successfully select a single
view increment the hit count for the corresponding view by one.
In practice, through the View Advisor, the DBA can easily add new
views and delete views that have not been used for a long time.
5
DECOMPOSE PROBLEMS BEFORE SOLVING
THEM.
In this section, we first discuss a phased process that differs from
previous methods in Section 5.1. Subsequently, in Section 5.2, we
explore how ChatBI employs Virtual Columns within this process.
5.1
Phased Process Flow
In BI scenarios, complex semantics, computational relationships,
and comparison relationships make it difficult for existing meth-
ods to handle NL2BI tasks effectively. Current process flows rely
solely on SFT data or prompts, and it is challenging to enumerate
all relationships and provide their corresponding golden SQL. Even
if these relationships and their golden SQL are obtained, incorpo-
rating them into prompts is not feasible, as it would lead to token
overflow and, concurrently, the trimming could impair the LLM’s
understanding of other knowledge. Recent studies have demon-
strated that the performance of LLMs on complex tasks can be
enhanced through a decomposed prompting technique [20]. This
method entails segmenting a task into multiple steps and employ-
ing the intermediate results to synthesize a final answer. Unlike
algebraic expressions, which consist of clearly delineated steps or
operations, the deconstruction of a complex SQL query poses a
significantly greater challenge due to the declarative nature of SQL
and the intricate interrelations among various query clauses [24].
In response to these challenges, ChatBI introduces a phased pro-
cess flow designed to decompose the NL2BI problem, specifically
aiming to effectively handle complex semantics, computational
relationships, and comparison relationships within BI scenarios.
Map
Stage 2
ChatBI
JnM (Json nested Map)
Json Example
{
“Key1” : [List1],
…
“KeyN” : “ValueN”
}
SQL results
Json results
Virtual Column
Key1 Metric1
Key2 Metric2
… …
KeyN MetricN
Apache
Superset
SQL results
Multiple forms of output
Table
Chart
Text
Figure 6: The difference between ChatBI and other NL2SQL
technology’s process flow.
As shown in Figure 6, ChatBI introduces a phased process flow
that initially utilizes a LLM to generate JSON-formatted intermedi-
ate outputs. These outputs are then fed into BI middleware such
as Apache SuperSet, which supports various output formats, to
display results. The key difference in the new process flow com-
pared to the traditional reliance on LLMs for SQL generation is its
reliance solely on generating JSON. This approach does not require,
as DAIL-SQL [16] does, providing extensive Natural Language and
its corresponding golden SQL in the prompt, which significantly
reduces the number of tokens.
The new process flow does not require the LLM to understand
the complex semantics associated with various SQL join operations,
nor does it need to grasp specific comparison relationships like DtD
(Day-to-Day) comparisons typical in BI scenarios, or the complex
computational relationships among columns. By decoupling the
complexity of SQL problems from the LLM, the difficulty of gener-
ating JSON in the new process is merely Simple (according to the
task grading by DIN-SQL and MAC-SQL based on the complexity
of SQL generation tasks, since ChatBI does not need to understand
SQL, all are classified as Simple tasks). This significantly enhances
the accuracy of the process flow. The step of generating SQL is exe-
cuted within Apache SuperSet, where extensive research from the
database community on generating SQL based on dimensions and
columns already exists, such as CatSQL [14], which combines deep
learning models with rule-based models. ChatBI implements multi-
ple universal templates in Apache SuperSet. And the JSON output
from the LLM acts like filling in the blanks for the corresponding
query template, eventually outputting SQL.
5.2
Virtual Column
Although the phased process flow, which first breaks down NL2SQL
into NL2JSON, reduces the task’s complexity, the lack of SQL un-
derstanding cannot be simply compensated by templates in Apache
SuperSet. Particularly for columns like “DAU”, which require com-
putational relationships derived from other columns, the complexity
of these relationships significantly reduces the accuracy of the tem-
plates. Therefore, ChatBI introduces the concept of Virtual Columns,
as illustrated in Figure 2. For the “dau” column, it does not phys-
ically exist in the data table or view but merely stores the SQL
computation relationships needed to derive it. Virtual columns like
these can be accessed through their corresponding key (column
8. name) to find the value (computation rule). We represent virtual
columns using a map structure nested into the JSON, referred to
as JnM, ensuring that all queries hitting Virtual Columns directly
access the stored computation rules.
Although the computational rules for columns like “DAU” in
the BI column are relatively standard, the NL2BI task possesses
its uniqueness. However, we believe that when applying LLMs to
specific tasks, they can be broken down into NL2JSON like ChatBI,
utilizing a map stored in the JSON to manage complex rules. This
approach not only reduces the difficulty of generating tasks for
LLMs, enhancing their accuracy, but also cleverly uses JnM to per-
ceive complex semantics and relationships. Relationships similar
to those in virtual columns can be generated by LLMs, thereby
facilitating caching to speed up computations in future calls.
In addition to “DAU”, here are two examples of Virtual Columns
in ChatBI. “staytime” represents the video playback duration in
seconds. A frequently asked column is the total playback dura-
tion in minutes, whose Virtual Column, stay_time_min, is stored in
the view’s schema with the calculation rule “sum(staytime) / 60”.
Another commonly queried column, the average playback dura-
tion per user, stay_time_per, is also stored in the view’s schema,
with the calculation rule “sum(staytime)/60/count(uid)”, where “uid”
represents the user ID.
6 EXPERIMENTAL EVALUATION
6.1 Experimental Setup
Datasets The Spider [33] and BIRD [22] datasets are commonly
utilized to evaluate the efficacy of NL2SQL parsers across diverse
databases. The two datasets are used in the NL2BI evaluation con-
text, where they adopt a SRD mode and overlook the significant
number of existing views. The presence of views can optimize most
cross-table queries within a single view. Therefore, we have devel-
oped two datasets from real BI scenario data, SRD dataset and
MRD dataset without golden SQLs, each provided with 4 views,
as shown in Figure 7. Each dataset contains 226 columns and 4
views and encompasses 153 real BI Chinese queries. To enhance the
difficulty of queries in the MRD dataset, 153 queries consist of 51
three-round dialogues. Among these dialogues, 19 involve queries
that switch views, meaning the tables accessed in the three rounds
of dialogue differ. We consider these 19 dialogues to represent chal-
lenging tasks. BIRD and Spider provide a golden SQL for each query,
a process that is quite challenging to construct. Therefore, from the
perspective of actual production systems, both SRD dataset and
MRD dataset do not have golden SQLs.
Evaluation Metrics Additionally, these NL2SQL technologies em-
ploy exact string match and execution match metrics. The former
only compares whether results share the same string expression,
which may lead to many correct results being considered incor-
rect. Relying solely on execution results depends on the dataset’s
predefined golden SQL, which could also lead to misjudgments.
For instance, in a BI context, for the query: “Yesterday’s interaction
metrics for users of app1”, the following golden SQL is as follows:
1
2
3
4
SELECT SUM ( click_pv )
SUM ( disp_pv ) AS
SUM ( play_pv ) AS
SUM ( like_pv ) AS
AS total_click_pv ,
total_disp_pv ,
total_play_pv ,
total_like_pv ,
Table 3: LLM leaderborad [5] 2024-03. Avg means the Average
performance, Lg means the Language and Knowl means the
Knowledge
LLM
GPT-4-Turbo
Claude3-Opus
GLM-4
Qwen-MAX-0107
Qwen-MAX-0403
Qwen-72B
5
6
7
8
9
10
11
O/C Source
Close Source
Close Source
Close Source
Close Source
Close Source
Open Source
Avg
62
60.5
57.8
55.8
55.6
54.5
Lg
54.9
44.6
57.7
59.4
58.8
59.6
Knowl
66.3
68.9
71
55.4
70.4
67.8
Agent
82
81.1
73.4
60
73.7
54.6
SUM ( comment_pv ) AS total_comment_pv ,
SUM ( collection_pv ) AS total_collection_pv ,
SUM ( shareto_pv ) AS total_shareto_pv ,
SUM ( follow_pv ) AS total_follow_pv
FROM table1
WHERE event_day = DATE_SUB ( CURRENT_DATE ,
INTERVAL 1 DAY )
AND appid = ' app1 ';
These columns click_pv, disp_pv, play_pv, like_pv, comment_pv,
collection_pv, shareto_pv, and follow_pv are interaction metrics. Yet
any non-empty set containing one or several of click_pv, disp_pv,
play_pv, like_pv, comment_pv, collection_pv, shareto_pv, or fol-
low_pv can be considered correct. The SQL below is also deemed
correct:
1
2
3
4
5
SELECT SUM ( click_pv ) AS total_click_pv ,
SUM ( disp_pv ) AS total_disp_pv
FROM table1
WHERE event_day = DATE_SUB ( CURRENT_DATE ,
INTERVAL 1 DAY )
AND appid = ' app1 ';
Following Previous work [13], we consider using the metric useful-
ness rather than correctness. So we use the metric useful execution
accuracy (UEX) which compares the execution output of the pre-
dicted SQL query and the intent of this query. SQL queries with
column errors and syntax errors are invariably considered incorrect.
The results of all executable queries are assessed by experienced
DBAs to determine whether they satisfy the user’s intent for that
specific query. We also provide the number of tokens used in the
prompts and responses for each method to assess a certain eco-
nomic cost. We use prompt tokens (PT) to represent the number of
tokens used in the prompts, and response tokens (RT) to represent
the number of tokens used in the responses.
Baselines
• DIN-SQL [24]. Din-SQL deconstructs the NL2SQL task into
discrete subtasks, formulating distinct prompts for each
subtask to guide GPT-4 through the completion of each
component, thereby deriving the final SQL.
• MAC-SQL [28]. MAC-SQL utilizes multi-agent collabora-
tion for NL2SQL tasks and reach an accuracy of 59.6% BIRD.
• ChatBI. ChatBI employs all the modules introduced in this
paper, matching Multi-Round Dialogues and performing
Single View Selection through a phased processing flow.
Initially, a LLM outputs JSON, which is subsequently used
to generate SQL that predicts the SQL corresponding to the
query.
9. SRD Dataset
MRD Dataset
Q1:近三天主版DAU及日环比和周同比波动情况? Q1-1:近三天主版DAU及日环比和周同比波动情况?
What is the DAU of the main version in the last
three days, including daily and weekly fluctuations? What is the DAU of the main version in the last
three days, including daily and weekly fluctuations?
Q2:近三天极速版DAU及日环比和周同比波动情况? Q1-2:极速版呢?
What is the DAU of the lite version in the last three
days, including daily and weekly fluctuations? What about the lite version?
Q3:查询昨天安卓新增用户有多少? Q1-3:新增用户的情况呢?
How many new Android users were there yesterday? What about the new user additions?
…
…
Q153:近七天男女活跃人数对比? Q51-3:近三个月的呢?
What is the gender comparison of active users over
the past seven days? What about the last three months?
Figure 7: Introduction to the SRD dataset and MRD dataset. The main version and the lite version correspond to different apps.
DAU stands for Daily Active Users, and new users refer to users who are registering for the first time.
Table 4: Useful execution accuracy (UEX), prompt tokens
(PT) and response tokens (RT) on SRD dataset.
Baseline
DIN-SQL
MAC-SQL
ChatBI
ChatBI w/o PF + zero-shot
UEX
20.92%
52.29%
74.43%
58.17%
PT
1472242
1272661
841550
586640
RT
53332
36952
29336
28886
• ChatBI w/o PF + zero-shot. The primary difference be-
tween ChatBI w/o PF + zero-shot and ChatBI is that it does
not employ a phased process flow. Similar to other NL2SQL
methods, it directly generates the corresponding SQL using
a LLM.
6.2
Evaluations on SRD Dataset
It is important to note that the all experiments utilize the “2023-03-
15-preview” version of GPT-4 [6] according to Table 3 and ChatBI
uses the Erniebot-4.0 [3] (The work is also done by Baidu Inc.) in
actual production systems due to data privacy concerns. In Table 6,
we report the performance of our method and baseline methods on
the SRD dataset. It is evident our method surpasses other baseline
methods using less tokens on both prompt and response.
ChatBI achieves higher UEX with fewer tokens on SRD dataset.
Compared to MAC-SQL, ChatBI improves the UEX metric by 42.34%
(74.43−52.29)
(1272661−841550)
(
), using 33.87% (
) fewer tokens in the
52.29
1272661
Table 5: Useful execution accuracy (UEX), prompt tokens
(PT) and response tokens (RT) on SRD dataset of ChatBI w/o
PF + zero-shot and ChatBI w/o PF + few-shot.
Baseline
ChatBI w/o PF + zero-shot
ChatBI w/o PF + few-shot
UEX
58.17%
49.67%
PT
586640
735144
RT
28886
24126
(36952−29336)
prompt stage and 20.61% (
) fewer tokens in the re-
36952
sponse stage. Relative to DIN-SQL, ChatBI increases the UEX metric
(74.43−20.92)
(1472242−841550)
by 255.78% (
), utilizing 42.84% (
) fewer
20.92
1472242
(53332−29336)
tokens in the prompt and 44.99% (
) fewer tokens in
53332
the response. ChatBI and MAC-SQL both provide column values in
their prompts, which offers significant adaptability for BI scenarios
that require explicit column specifications, and it largely prevents
SQL execution errors caused by the hallucinations of LLMs. In
contrast, DIN-SQL does not provide column values in its prompts,
resulting in a higher likelihood of errors in the filter conditions of
the generated SQL. Furthermore, during the execution of DIN-SQL,
it has been observed that it fails to analyze queries in Chinese. This
issue arises because its prompts consist entirely of English without
any Chinese elements. In cases where DIN-SQL produces unana-
lyzable SQL for Chinese queries, we re-execute it until it outputs
SQL. Therefore, we believe that ChatBI achieves higher accuracy
with fewer tokens used in the NL2BI SRD task.
10. Table 6: Useful execution accuracy (UEX), prompt tokens
(PT) and response tokens (RT) on MRD dataset.
, the prompt for Query2 is shown below:
1
2
Baseline
DIN-SQL
MAC-SQL
ChatBI
ChatBI w/o PF + zero-shot
UEX
20.26%
35.29%
67.32%
52.94%
PT
1526638
1685080
799442
592074
RT
65850
30095
20766
18723
3
4
5
The phased process flow assists LLMs in understanding com-
plex semantics, computations, and comparison relationships.
(841550−586640)
Although ChatBI uses 43.45% (
) more tokens in
586640
prompts compared to ChatBI w/o PF + zero-shot, it achieves a
(74.43−58.17)
27.95% (
) improvement in UEX. We find that this is
58.17
primarily because the phased process flow enables the LLM to by-
pass the need to comprehend complex semantics, computations,
and comparison relationships directly within SQL. Instead, it only
needs to understand how to map these complex relationships to
the outputs required in JSON. By reducing the complexity of the
task, the LLM can provide more accurate results, leading to an
enhancement in UEX. Although the phased processing flow does
not require the LLM to perceive SQL directly, it still necessitates
an understanding of the JSON data type and the various parameter
forms needed for subsequent SQL generation, hence more tokens
are used in the prompts.
Longer prompts do not necessarily lead to better results.
ChatBI w/o PF + zero-shot, which merely changes its output format
from JSON to SQL compared to ChatBI, prompted us to explore
whether increasing the number of shots could enable direct SQL
generation to achieve or surpass the performance of JSON gen-
eration. In Table 5, we examine the UEX, PT, and RT for ChatBI
w/o PF + zero-shot and ChatBI w/o PF + few-shot on the SRD
dataset. The table shows that with more shots provided, ChatBI
(58.17−49.67)
w/o PF + few-shot actually experienced a 14.61% (
) de-
58.17
crease in UEX compared to ChatBI w/o PF + zero-shot, which uses
(735144−586640)
25.31% (
) more tokens in prompts. Similarly, previ-
586640
ous studies [13] have also mentioned that longer prompts might
cause degradation in large language models. Furthermore, shots
that yield positive outcomes often require meticulous design [16].
Therefore, we do not extensively discuss the impact of shots in this
paper.
6.3
Evaluations on MRD Dataset
Since DIN-SQL [24] and MAC-SQL [28] are not designed for Multi-
Round Dialogues, we have made the following modifications in
the prompts. The MRD Dataset consists of 51 sets of three-round
queries, namely Query1, Query2, and Query3. The prompt for
Query1 is as follows:
1
2
3
4
5
/*
As a data analysis expert , you are to extract the
necessary information from the data provided and
output the corresponding SQL query based on the user
's question .
Let us consider the problem step by step .
The current question is { Query1 }.
*/
/*
As a data analysis expert , you are to extract the
necessary information from the data provided and
output the corresponding SQL query based on the user
's question .
Let us consider these problems step by step .
The last question is { Query1 } , and the current question
is { Query2 }.
*/
, and the prompt for Query3 is also provided as follows:
1
2
3
4
5
/*
As a data analysis expert , you are to extract the
necessary information from the data provided and
output the corresponding SQL query based on the user
's question .
Let us consider these problems step by step .
The last two question are { Query1 } and { Query2 }, and the
current question is { Query3 }.
*/
. In reality, within actual scenarios, LLMs do not know whether a
query belongs to a MRD or a SRD for a better interactive experience.
Unless it is explicitly required that users inform the LLM that the
issue should be considered in conjunction with the previous ques-
tion. In the experiments, ChatBI is not informed about which turn
of a MRD the dialogue belongs to. And this is entirely determined
by the Multi-Round Dialogue Matching module.
ChatBI achieves higher UEX with fewer tokens on MRD
dataset. Compared to MAC-SQL, ChatBI improves the UEX metric
(67.32−35.29)
(1685080−799442)
by 90.76% (
), using 52.56% (
) fewer to-
35.29
1685080
(30095−20766)
kens in the prompt stage and 31.00% (
) fewer tokens in
30095
the response stage. Relative to DIN-SQL, ChatBI increases the UEX
(67.32−20.26)
(1526638−799442)
metric by 232.28% (
), utilizing 47.63% (
)
20.26
1526638
(65850−20766)
fewer tokens in the prompt and 74.09% (
) fewer to-
65850
kens in the response. As demonstrated in the evaluation of the
SRD dataset, the presence of field hallucinations still results in DIN-
SQL’s SQL prediction accuracy being inferior to that of MAC-SQL
and ChatBI. And ChatBI continues to achieve a higher UEX with
fewer tokens in the MRD dataset.
The Multi-Round Dialogue Matching module is better equipped
to handle MRD scenarios. The most significant change in the
MRD dataset compared to the SRD dataset is that the questions are
part of MRD, which introduces follow-up queries. For ChatBI, its
(67.32−35.29)
UEX metric in the MRD dataset decreased by 9.55% (
)
35.29
compared to the SRD dataset, while MAC-SQL decreased by 32.51%
(52.29−35.29)
(
). This indicates that, compared to using LLMs to di-
52.29
rectly handle Multi-Round Dialogues, ChatBI’s proposed Multi-
Round Dialogue Matching module is more effective in managing
MRD scenarios.
The Phased Process Flow and Virtual Columns are more
adept at handling complex semantics, computations, and
comparative relationships. In the SRD dataset, it is observed that
ChatBI, compared to ChatBI w/o PF + zero-shot, shows a 27.95%
(74.43−58.17)
(
) improvement in UEX. Similarly, in the MRD dataset,
58.17
(67.32−52.94)
ChatBI shows a 27.16% (
) increase in UEX compared
52.94
to ChatBI w/o PF + zero-shot. We find that this is primarily because
ChatBI achieves higher accuracy on SQL tasks involving complex
11. semantics, computations, and comparisons in both datasets. There-
fore, the Phased Process Flow and Virtual Column prove to be more
effective in managing these challenging relationships.
6.4
Discussion
The limitations of the datasets primarily stem from the evalu-
ation phase. As introduced, both the SRD dataset and MRD dataset
contain only 153 queries composed by Database Administrators
(DBAs) for Business Intelligence scenarios combined with four
views. Compared to Spider [33]’s 10181 questions and BIRD [22]’s
12751 questions, the SRD and MRD datasets contain too few queries.
This is mainly because, in actual NL2BI tasks, most questions do
not have what is referred to as a “golden SQL”. For example, a user
might ask, “How many new users were added in the last three days?”
(The one of real user questions in Baidu Inc). The SQL below rep-
resents the number of users added each day over the past three
days:
1
2
3
4
5
6
SELECT event_day , COUNT ( DISTINCT uid )
FROM t6628
WHERE is_video_new = 1
AND event_day BETWEEN DATE_SUB ( CURRENT_DATE ,
INTERVAL 3 DAY ) AND DATE_SUB ( CURRENT_DATE ,
INTERVAL 1 DAY )
GROUP BY event_day
ORDER BY event_day ASC ;
, while the following SQL indicates the total number of users added
over the same period:
1
2
3
4
SELECT COUNT ( DISTINCT uid )
FROM t6628
WHERE is_video_new = 1
AND event_day BETWEEN DATE_SUB ( CURRENT_DATE ,
INTERVAL 3 DAY ) AND DATE_SUB ( CURRENT_DATE ,
INTERVAL 1 DAY ) ;
. Here, is_video_new represents whether a user is new, with is_video_new
= 1 indicating a new user and is_video_new = 0 indicating not a
new user. The uid represents the user ID, and event_day represents
the date. In the BI scenario, it can be assumed that today’s data is
not yet prepared. Thus, similar to the previous work [13], we use
useful execution accuracy (UEX) rather than execution accuracy
(EX) to evaluate, as both SQLs are considered correct in our context,
whereas in other datasets, only one result would be deemed correct.
Expanding the dataset size is hampered by the challenge of accu-
rately assessing effectiveness, which poses a significant challenge.
7
RELATED WORK
Prompting-Based NL2SQL. LLMs such as GPT-3 [6], LLaMA [4],
Codex [8], PaLM [23], and Claude 3 [2] are breaking traditional prac-
tices in natural language processing (NLP) communities, providing
new possibilities for solving NL2SQL tasks. Inspired by chain-of-
thought [30], an increasing number of studies focus on designing ef-
fective prompting techniques to better utilize LLMs for various sub-
tasks. For example, in the NL2SQL scenario, DIN-SQL [24], C3 [12],
and SQL-PaLM [25] provide prompting techniques to GPT-4, GPT-
3.5, and PaLM respectively, greatly enhancing the accuracy of LLMs
in generating NL2SQL tasks. GPT mentions in the paper that using
different data structure forms, such as using the character # versus
not using it, can make a significant difference. Therefore, it is nec-
essary to focus on effective and novel data structures for prompting
techniques.
Pre-trained and Supervised Fine-Tuning-Based NL2SQL. In
the domain of NLP, prior to the emergence of LLMs, the predomi-
nant methodology [9–11, 15, 17–19, 27, 29, 31, 32] for addressing
the NL2SQL task involved the fine-tuning of an "encoder-decoder"
model. Leveraging the sequence-to-sequence paradigm, this ap-
proach facilitated the modeling of NL2SQL as a transformation
from natural language to the corresponding SQL output. Further-
more, the advent of pre-training techniques enabled supervised
fine-tuning on specific datasets, thereby augmenting the encoder’s
capabilities and consequently yielding a more effective resolution of
the NL2SQL task. Typically, this process necessitates a considerable
volume of data pertinent to databases and SQL. Unlike methodolo-
gies predicated upon large language models and prompting strate-
gies, the pre-training of language models in this approach does not
accommodate the guidance of thought chains during utilization,
rendering it somewhat constrained.
LLM Designed Specifically for NL2SQL. The recent study CodeS [21]
has developed a LLM tailored specifically for the NL2SQL task, with
a parameter size ranging from 1B to 15B, yielding commendable
outcomes. Nonetheless, it is imperative to take into account both
the costs associated with training and data acquisition. In compari-
son to Supervised Fine-Tuning (SFT), this approach necessitates a
more extensive dataset. Such methodologies exhibit a propensity
towards specialization, diverging from the inherent versatility of
LLMs (i.e., their adaptability to a multitude of subtasks). CodeS still
relies on high-quality training data, which is often difficult to obtain.
In the absence of sufficient training data for the NL2BI task, the
data augmentation and few-shot inference approach proposed by
CodeS have not been validated in this scenario. Consequently, this
particular LLM type falls outside our scope of consideration.
8
CONCLUSION
In this paper, we describe and evaluate ChatBI, a technology that
converts Natural Language into complex SQL for Business Intel-
ligence (BI) scenarios. First, ChatBI employs smaller and cheaper
models compared to LLMs to match the Multi-Round Dialogue
scenarios encountered by users in practice. Second, ChatBI initially
leverages view technology from the database community to trans-
form challenging schema linking and metric ambiguity issues in BI
scenarios into a Single View Selection problem, also using smaller
and cheaper models to address this problem. Finally, ChatBI first
introduces a phased process flow, which decompose the NL2BI
task. ChatBI then proposes Virtual Columns mapping to handle
complex semantics, computations, and comparative relationships.
Based on this design, ChatBI effectively understands these complex
relationships within the NL2BI task. In actual production systems,
ChatBI has enabled many non-expert users to perform expert-level
BI analysis.
ACKNOWLEDGMENTS
The authors are grateful to the Baidu MEG’s MEDD team, and in
particular, to Jiawei Liu, Chaoxian Gui, Yuchao Jiang, and many
fellow team members.
12. REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[n.d.]. Apache Superset. https://superset.apache.org/.
[n.d.]. Claude3. https://www.anthropic.com/claude.
[n.d.]. Erinebot. https://yiyan.baidu.com/.
[n.d.]. LLama. https://llama.meta.com/.
[n.d.]. LLM Ranks. https://rank.opencompass.org.cn/leaderboard-llm.
[n.d.]. OpenAI GPT. https://openai.com/pricing.
[n.d.]. OpenAI pricing. https://openai.com/pricing.
[n.d.]. OpenAICodex. https://openai.com/blog/openai-codex.
Ruichu Cai, Jinjie Yuan, Boyan Xu, and Zhifeng Hao. 2021. Sadga: Structure-
aware dual graph aggregation network for text-to-sql. Advances in Neural
Information Processing Systems 34 (2021), 7664–7676.
Ruisheng Cao, Lu Chen, Zhi Chen, Yanbin Zhao, Su Zhu, and Kai Yu. 2021.
LGESQL: line graph enhanced text-to-SQL model with mixed local and non-local
relations. arXiv preprint arXiv:2106.01093 (2021).
DongHyun Choi, Myeong Cheol Shin, EungGyun Kim, and Dong Ryeol Shin.
2021. Ryansql: Recursively applying sketch-based slot fillings for complex text-to-
sql in cross-domain databases. Computational Linguistics 47, 2 (2021), 309–332.
Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin,
Dongfang Lou, et al. 2023. C3: Zero-shot text-to-sql with chatgpt. arXiv preprint
arXiv:2307.07306 (2023).
Avrilia Floratou, Fotis Psallidas, Fuheng Zhao, S Deep, G Hagleither, W Tan, J
Cahoon, R Alotaibi, J Henkel, A Singla, et al. 2024. Nl2sql is a solved problem...
not!. In 14th Annual Conference on Innovative Data Systems Research (CIDR’24).
Han Fu, Chang Liu, Bin Wu, Feifei Li, Jian Tan, and Jianling Sun. 2023. Catsql:
Towards real world natural language to sql applications. Proceedings of the VLDB
Endowment 16, 6 (2023), 1534–1547.
Yujian Gan, Xinyun Chen, Jinxia Xie, Matthew Purver, John R Woodward, John
Drake, and Qiaofu Zhang. 2021. Natural SQL: Making SQL easier to infer from
natural language specifications. arXiv preprint arXiv:2109.05153 (2021).
Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding,
and Jingren Zhou. 2023. Text-to-sql empowered by large language models: A
benchmark evaluation. arXiv preprint arXiv:2308.15363 (2023).
Alex Graves and Alex Graves. 2012. Long short-term memory. Supervised
sequence labelling with recurrent neural networks (2012), 37–45.
Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno,
and Julian Martin Eisenschlos. 2020. TaPas: Weakly supervised table parsing via
pre-training. arXiv preprint arXiv:2004.02349 (2020).
Wonseok Hwang, Jinyeong Yim, Seunghyun Park, and Minjoon Seo. 2019. A
comprehensive exploration on wikisql with table-aware word contextualization.
arXiv preprint arXiv:1902.01069 (2019).
Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter
Clark, and Ashish Sabharwal. 2022. Decomposed prompting: A modular approach
for solving complex tasks. arXiv preprint arXiv:2210.02406 (2022).
Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie
Wei, Hongyan Pan, Cuiping Li, and Hong Chen. 2024. CodeS: Towards Building
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
Open-source Language Models for Text-to-SQL. arXiv preprint arXiv:2402.16347
(2024).
Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang,
Bowen Qin, Ruiying Geng, Nan Huo, et al. 2024. Can llm already serve as a
database interface? a big bench for large-scale database grounded text-to-sqls.
Advances in Neural Information Processing Systems 36 (2024).
Sharan Narang and Aakanksha Chowdhery. 2022. Pathways language model
(palm): Scaling to 540 billion parameters for breakthrough performance. Google
AI Blog (2022).
Mohammadreza Pourreza and Davood Rafiei. 2024. Din-sql: Decomposed in-
context learning of text-to-sql with self-correction. Advances in Neural Informa-
tion Processing Systems 36 (2024).
Ruoxi Sun, Sercan O Arik, Hootan Nakhost, Hanjun Dai, Rajarishi Sinha,
Pengcheng Yin, and Tomas Pfister. 2023. Sql-palm: Improved large language
modeladaptation for text-to-sql. arXiv preprint arXiv:2306.00739 (2023).
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin
Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. Ernie: Enhanced representa-
tion through knowledge integration. arXiv preprint arXiv:1904.09223 (2019).
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
with neural networks. Advances in neural information processing systems 27
(2014).
Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Qian-Wen Zhang,
Zhao Yan, and Zhoujun Li. 2023. Mac-sql: Multi-agent collaboration for text-to-
sql. arXiv preprint arXiv:2312.11242 (2023).
Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew
Richardson. 2019. Rat-sql: Relation-aware schema encoding and linking for
text-to-sql parsers. arXiv preprint arXiv:1911.04942 (2019).
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi,
Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason-
ing in large language models. Advances in neural information processing systems
35 (2022), 24824–24837.
Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020.
TaBERT: Pretraining for joint understanding of textual and tabular data. arXiv
preprint arXiv:2005.08314 (2020).
Tao Yu, Chien-Sheng Wu, Xi Victoria Lin, Bailin Wang, Yi Chern Tan, Xinyi
Yang, Dragomir Radev, Richard Socher, and Caiming Xiong. 2020. Grappa:
Grammar-augmented pre-training for table semantic parsing. arXiv preprint
arXiv:2009.13845 (2020).
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James
Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. 2018. Spider: A large-scale
human-labeled dataset for complex and cross-domain semantic parsing and
text-to-sql task. arXiv preprint arXiv:1809.08887 (2018).
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang,
Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. 2022. Least-to-
most prompting enables complex reasoning in large language models. arXiv
preprint arXiv:2205.10625 (2022).