Evaluating and Enhancing LLMs for Multi-turn Text-to-SQL with Multiple Question Types

Harbin University of Science and Technology, Harbin Institute of Technology
Under Review 2025

*Any questions? 📫 orlosziming@outlook.com

Abstract

Recent advancements in large language models (LLMs) have boosted text-to-SQL systems, but many miss the mark on handling real-world conversations. This can lead to issues with tricky questions that SQL alone can't solve. To tackle this, we created MMSQL, a test suite that checks how well LLMs handle different question types and multi-turn chats. We tested popular LLMs and found what affects their performance. Plus, we developed a multi-agent system to better identify question types and choose the right strategies. Our experiments show this enhances model’s ability to navigate the complexities of conversational dynamics.

Motivation

fig.1

Fig. 1 A four-turn dialogue example that includes four different types of Q&A pairs.

In the realm of text-to-SQL systems, which empower non-technical users to query databases using natural language, the advent of large language models (LLMs) has led to significant advancements. However, a critical challenge remains: handling the dynamic and uncertain nature of real-world user queries. These queries often unfold over multiple turns and can be ambiguous or unanswerable. For instance, users may ask questions that lack the necessary data in the database or are open to multiple interpretations, as illustrated in Figure 1. Additionally, some queries may be entirely unrelated to the database content, requiring the system to discern when not to generate SQL responses. Current models frequently assume all queries are clear and answerable, a limitation that can lead to hallucinated content and unreliable predictions. Addressing these issues is essential for developing more robust and dependable text-to-SQL systems.

Dataset

Table 1

Table 1 Comparison of multi-turn or multi-type text-to-SQL datasets.

We used QDA-SQL to generate a training set with 4 question types: "answerable", "unanswerable", "improper", "ambiguous". We manually annotated the test set, as shown in Table 1. Our MMSQL dataset, with its complex multi-turn and multi-type characteristics, reflects more realistic scenarios.

fig.2

Fig. 2 Overview of QDA-SQL processes. (Method of dataset construction)

QDA-SQL (Questions Enhanced Dialogue Augmentation for Multi-turn Text-to-SQL) uses Chain of Thought (CoT) reasoning to generate multi-turn Q&A pairs step-by-step. It combines context relationships and question types randomly, followed by a refinement process, guiding Gemini Pro to create diverse datasets. This ensures each sample fits our defined question types. The samples produced by QDA-SQL showed higher natural language annotation quality, with a 62% win rate, and included more complex text-to-SQL examples compared to the original dataset.

In our dataset generation process, we used samples from SPARC and CoSQL as GoalSQL as GoalSQL, transforming them into multi-turn, multi-type datasets in our format. You can preview or download the following datasets:

Dataset Download Link
MMSQL Test Set (238 KB) Preview | Download
MMSQL Training Set (16.5 MB) Preview | Download

Evaluation

Our evaluation metrics include Exact Matching (EM) and Execution Accuracy (EX). EM compares each predicted clause to the reference query, considering it correct only if all components match, excluding values. EX evaluates the proportion of SQL where the execution results of both predicted and ground-truth SQL are identical. Interaction Execution Accuracy (IEX) is achieved when all SQL queries in a multi-turn interaction execute correctly. We also developed the Dual Assessment of Question Type Detection and Execution Accuracy (TDEX) to assess the comprehensive capability of text-to-SQL models with complex queries. Additionally, the Response Quality Score (RQS) measures the quality of the model's natural language responses.

TDEX is a comprehensive metric that evaluates the accuracy of user query type classification and execution accuracy. For a set of \( N \) questions, where \( C_i \) denotes the expected classification and \( \hat{C}_i \) represents the predicted classification for the \( i \)-th question, \( S_i \) denotes the ground truth SQL query and \( \hat{S}_i \) represents the predicted SQL query for the \( i \)-th question, TDEX is computed as:

\[ \text{TDEX} = \frac{1}{N} \sum_{i=1}^{N} \left\{ \begin{array}{ll} \varepsilon_{\text{exec}}(S_i, \widehat{S}_i) & \text{(a)} \\ \delta_{\text{type}}(C_i, \widehat{C}_i) & \text{(b)}\\ \end{array} \right. \] \[ \begin{array}{ll} \text{(a)} & C_i = \mathrm{'Answerable' } ~\text{or}~ \mathrm{ 'Ambiguous'} \\ \text{(b)} & \mathrm{otherwise} \end{array} \]

where \( \varepsilon_{\text{exec}} = 1 \) if the execution result of \( \hat{S}_i \) matches the execution result of \( S_i \), and \( \varepsilon_{\text{exec}} = 0 \) otherwise; \( \delta_{\text{type}} = 1 \) if \( \hat{C}_i \) matches \( C_i \), and \( \delta_{\text{type}} = 0 \) otherwise.

To evaluate the quality of a model's responses, we use an LLM-assisted rating method assessing Utility, Accuracy, Completeness, Clarity, and Relevance on a 0-to-2 scale, with a maximum score of 10.

Model TDEX IEX EX EM RQS
GPT-4 Turbo 67.0 30.2 70.0 51.0 5.80
GPT-3.5 Turbo 64.1 25.5 69.6 47.3 4.74
Gemini-1.5 Flash 65.8 30.1 70.0 52.3 4.03
Llama3-70B 62.8 22.8 66.4 47.4 3.86
Llama3-8B 64.0 20.1 66.1 45.7 4.55
SQLCoder-8B 30.7 24.8 63.2 31.0 3.43
Codellama-7B 30.7 3.4 27.2 21.7 5.09
Mistral-7B-v0.2 26.4 0.7 13.7 11.2 4.37

Table 2 Performance of the models on MMSQL test set.

As shown in table 2, we evaluated several Large Language Models (LLMs) on their zero-shot capabilities in handling multi-turn conversational text-to-SQL tasks using the MMSQL test set. Our main findings as below:

  • Closed-source models currently have a slight performance edge, but open-source models are rapidly catching up, showing strong potential.
  • All models struggle with unanswerable and ambiguous questions, marking areas that need significant improvement.
  • Incorporating clarification processes in multi-turn interactions allows models to formulate more precise SQL queries, enhancing the usability and effectiveness of text-to-SQL systems.

Multi-Agent Framework

fig.2

Fig. 2 The overview of our multi-agent framework comprises 4 components: (i) the Schema Selector, which narrows down the database schema to focus on relevant tables, reducing noise from irrelevant data; (ii) the Question Detector, which determines its type and reformulates it if the question is deemed potentially ambiguous, potentially generating multiple possible rewrites; (iii) the Question Decomposer, which breaks down complex questions into simpler, manageable sub-questions and applies chain-of-thought reasoning to solve them progressively; and (iv) the SQL Refiner, which utilizes an external tool for SQL execution, gathers feedback, then refines faulty SQL queries.

We have developed a multi-agent collaborative framework designed to improve the performance of Large Language Models (LLMs) in text-to-SQL parsing tasks. This framework leverages intelligent agents with specialized functionalities, working together to enhance the accuracy and efficiency of converting natural language queries into SQL commands.
Our framework consists of four key components:

  • Schema Selector: This agent identifies the most relevant subset of the database schema needed to answer a given question, reducing noise and potential confusion caused by irrelevant data.

  • Question Detector: It determines the type of question and decides on the appropriate response strategy, effectively handling ambiguous or unanswerable queries by providing clarifications or alternative formulations.

  • Question Decomposer: This component breaks down complex questions into simpler, manageable sub-questions, applying chain-of-thought reasoning to progressively solve each part of the query.

  • SQL Refiner: It refines and verifies the SQL queries generated, utilizing external tools to execute the SQL statements, gather feedback, and iteratively correct any errors to ensure the accuracy of the final output.
  • Table 3

    Table 3 Performance metrics comparison on the MMSQL test set, demonstrating the improvements achieved by integrating the multi-agent framework with different models. Specific improvements are indicated in parentheses.

    Table 4

    Table 4 Ablation study examining the impact of each component within the multi-agent framework on the model’s performance on the MMSQL test set.

    Our experiments demonstrate that integrating this multi-agent framework with existing LLMs significantly improves their performance in text-to-SQL tasks. By collaboratively employing specialized agents, our multi-agent framework enhances the capabilities of LLMs in handling complex, ambiguous, and multi-turn queries in text-to-SQL parsing. Each component contributes uniquely to the overall performance, ensuring that natural language queries are accurately and efficiently translated into SQL commands.

    Full-Text

    BibTeX

    @misc{guo2024evaluatingenhancingllmsmultiturn,
            title={Evaluating and Enhancing LLMs for Multi-turn Text-to-SQL with Multiple Question Types}, 
            author={Ziming Guo and Chao Ma and Yinggang Sun and Tiancheng Zhao and Guangyao Wang and Hai Huang},
            year={2024},
            eprint={2412.17867},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2412.17867}, 
      }

    This work is licensed under CC BY 4.0. You are free to use the code or dataset, please be sure to cite it in your paper or repository.