Fig. 2 Overview of QDA-SQL processes. (Method of dataset construction)
QDA-SQL (Questions Enhanced Dialogue
Augmentation for Multi-turn Text-to-SQL) uses Chain of Thought (CoT) reasoning to generate multi-turn Q&A
pairs step-by-step. It combines context relationships and question types randomly, followed by a refinement
process, guiding Gemini Pro to create diverse datasets. This ensures each sample fits our defined question
types. The samples produced by QDA-SQL showed higher natural language annotation quality, with a 62% win
rate, and included more complex text-to-SQL examples compared to the original dataset.
In our dataset generation process, we used samples from SPARC and CoSQL as GoalSQL as GoalSQL, transforming them into multi-turn, multi-type datasets in our format. You can preview or download the following datasets:
Evaluation
Our evaluation metrics include Exact Matching (EM) and Execution Accuracy (EX). EM compares each predicted clause to the reference query, considering it correct only if all components match, excluding values. EX evaluates the proportion of SQL where the execution results of both predicted and ground-truth SQL are identical. Interaction Execution Accuracy (IEX) is achieved when all SQL queries in a multi-turn interaction execute correctly. We also developed the Dual Assessment of Question Type Detection and Execution Accuracy (TDEX) to assess the comprehensive capability of text-to-SQL models with complex queries. Additionally, the Response Quality Score (RQS) measures the quality of the model's natural language responses.
TDEX is a comprehensive metric that evaluates the accuracy of user query type classification and execution accuracy. For a set of \( N \) questions, where \( C_i \) denotes the expected classification and \( \hat{C}_i \) represents the predicted classification for the \( i \)-th question, \( S_i \) denotes the ground truth SQL query and \( \hat{S}_i \) represents the predicted SQL query for the \( i \)-th question, TDEX is computed as:
\[
\text{TDEX} = \frac{1}{N} \sum_{i=1}^{N} \left\{
\begin{array}{ll}
\varepsilon_{\text{exec}}(S_i, \widehat{S}_i) & \text{(a)} \\
\delta_{\text{type}}(C_i, \widehat{C}_i) & \text{(b)}\\
\end{array}
\right.
\]
\[
\begin{array}{ll}
\text{(a)} & C_i = \mathrm{'Answerable' } ~\text{or}~ \mathrm{ 'Ambiguous'} \\
\text{(b)} & \mathrm{otherwise}
\end{array}
\]
where \( \varepsilon_{\text{exec}} = 1 \) if the execution result of \( \hat{S}_i \) matches the execution result of \( S_i \), and \( \varepsilon_{\text{exec}} = 0 \) otherwise; \( \delta_{\text{type}} = 1 \) if \( \hat{C}_i \) matches \( C_i \), and \( \delta_{\text{type}} = 0 \) otherwise.
To evaluate the quality of a model's responses, we use an LLM-assisted rating method assessing Utility, Accuracy, Completeness, Clarity, and Relevance on a 0-to-2 scale, with a maximum score of 10.
Model |
TDEX |
IEX |
EX |
EM |
RQS |
GPT-4 Turbo |
67.0 |
30.2 |
70.0 |
51.0 |
5.80 |
GPT-3.5 Turbo |
64.1 |
25.5 |
69.6 |
47.3 |
4.74 |
Gemini-1.5 Flash |
65.8 |
30.1 |
70.0 |
52.3 |
4.03 |
Llama3-70B |
62.8 |
22.8 |
66.4 |
47.4 |
3.86 |
Llama3-8B |
64.0 |
20.1 |
66.1 |
45.7 |
4.55 |
SQLCoder-8B |
30.7 |
24.8 |
63.2 |
31.0 |
3.43 |
Codellama-7B |
30.7 |
3.4 |
27.2 |
21.7 |
5.09 |
Mistral-7B-v0.2 |
26.4 |
0.7 |
13.7 |
11.2 |
4.37 |
Table 2 Performance of the models on MMSQL test set.
As shown in table 2, we evaluated several Large Language Models (LLMs) on their zero-shot capabilities in handling multi-turn conversational text-to-SQL tasks using the MMSQL test set. Our main findings as below:
- Closed-source models currently have a slight performance edge, but open-source models are rapidly catching up, showing strong potential.
- All models struggle with unanswerable and ambiguous questions, marking areas that need significant improvement.
- Incorporating clarification processes in multi-turn interactions allows models to formulate more precise SQL queries, enhancing the usability and effectiveness of text-to-SQL systems.