{"ID":2860078,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.05318","arxiv_id":"2510.05318","title":"BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions","abstract":"Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short by treating conversation histories as static context or limiting evaluation to read-only operations, failing to reflect production-grade database assistant challenges. We introduce BIRD-INTERACT, a benchmark that restores this realism through: (1) a comprehensive interaction environment coupling each database with a hierarchical knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from errors without human supervision; (2) two evaluation settings consisting of a pre-defined conversational protocol (c-Interact) and an open-ended agentic setting (a-Interact) where models autonomously decide when to query the user simulator or explore the environment; (3) a challenging task suite covering the full CRUD spectrum for business-intelligence and operational use cases, guarded by executable test cases. Each task features ambiguous and follow-up sub-tasks requiring dynamic interaction. The suite comprises BIRD-INTERACT-FULL (600 tasks, up to 11,796 interactions) for comprehensive performance assessment, and BIRD-INTERACT-LITE (300 tasks with simplified databases) for detailed behavioral analysis and rapid method development. Our empirical results highlight BIRD-INTERACT's difficulty: GPT-5 completes only 8.67% of tasks in c-Interact and 17.00% in a-Interact. Analysis via memory grafting and Interaction Test-time Scaling validates the importance of effective interaction for complex, dynamic text-to-SQL tasks.","short_abstract":"Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short by treating...","url_abs":"https://arxiv.org/abs/2510.05318","url_pdf":"https://arxiv.org/pdf/2510.05318v3","authors":"[\"Nan Huo\",\"Xiaohan Xu\",\"Jinyang Li\",\"Per Jacobsson\",\"Shipei Lin\",\"Bowen Qin\",\"Binyuan Hui\",\"Xiaolong Li\",\"Ge Qu\",\"Shuzheng Si\",\"Linheng Han\",\"Edward Alexander\",\"Xintong Zhu\",\"Rui Qin\",\"Ruihan Yu\",\"Yiyao Jin\",\"Feige Zhou\",\"Weihao Zhong\",\"Yun Chen\",\"Hongyu Liu\",\"Chenhao Ma\",\"Fatma Ozcan\",\"Yannis Papakonstantinou\",\"Reynold Cheng\"]","published":"2025-10-06T19:31:47Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
