{"ID":2842211,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.10192","arxiv_id":"2511.10192","title":"Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL","abstract":"The data-centric paradigm has emerged as a pivotal direction in artificial intelligence (AI), emphasizing the role of high-quality training data. This shift is especially critical in the Text-to-SQL task, where the scarcity, limited diversity, and structural simplicity of existing datasets constrain model performance. To address these challenges, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that systematically generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from limited seed data. Our framework spans six augmentation dimensions and integrates an end-to-end pipeline with auxiliary database selection, SQL executability verification, natural language (NL) question generation, NL-SQL correspondence verification, and chain-of-thought (CoT) reasoning trace generation. Leveraging this framework, we construct SQLFlow, a high-quality dataset comprising 75,386 annotated examples. We demonstrate the utility of SQLFlow in both fine-tuning and prompt-based settings. (1) For open-source large language models (LLMs), fine-tuning with SQLFlow improves problem-solving ability, delivering competitive gains across multiple benchmarks under the same data budget. (2) For closed-source LLMs, we propose a masked alignment retrieval method that uses SQLFlow as both a knowledge base and training data for the retrieval model, enabling structure-aware example matching via fine-grained NL-SQL alignments. Experiments show that our retrieval strategy outperforms existing example retrieval methods, highlighting the combined value of SQLFlow's data quality and our retrieval technique. Overall, our work provides a scalable, data-centric foundation for advancing Text-to-SQL systems and underscores the importance of structured, high-fidelity data in modern AI development. Our code is available at https://github.com/TechNomad-ds/Text2SQL-Flow.","short_abstract":"The data-centric paradigm has emerged as a pivotal direction in artificial intelligence (AI), emphasizing the role of high-quality training data. This shift is especially critical in the Text-to-SQL task, where the scarcity, limited diversity, and structural simplicity of existing datasets constrain model performance....","url_abs":"https://arxiv.org/abs/2511.10192","url_pdf":"https://arxiv.org/pdf/2511.10192v4","authors":"[\"Qifeng Cai\",\"Hao Liang\",\"Chang Xu\",\"Tao Xie\",\"Wentao Zhang\",\"Bin Cui\"]","published":"2025-11-13T11:02:15Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.DB\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":607113,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2842211,"paper_url":"https://arxiv.org/abs/2511.10192","paper_title":"Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL","repo_url":"https://github.com/TechNomad-ds/Text2SQL-Flow","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}