{"ID":2847836,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.26840","arxiv_id":"2510.26840","title":"SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification","abstract":"Community-driven Text-to-SQL evaluation platforms play a pivotal role in tracking the state of the art of Text-to-SQL performance. The reliability of the evaluation process is critical for driving progress in the field. Current evaluation methods are largely test-based, which involves comparing the execution results of a generated SQL query and a human-labeled ground-truth on a static test database. Such an evaluation is optimistic, as two queries can coincidentally produce the same output on the test database while actually being different. In this work, we propose a new alternative evaluation pipeline, called SpotIt, where a formal bounded equivalence verification engine actively searches for a database that differentiates the generated and ground-truth SQL queries. We develop techniques to extend existing verifiers to support a richer SQL subset relevant to Text-to-SQL. A performance evaluation of ten Text-to-SQL methods on the high-profile BIRD dataset suggests that test-based methods can often overlook differences between the generated query and the ground-truth. Further analysis of the verification results reveals a more complex picture of the current Text-to-SQL evaluation.","short_abstract":"Community-driven Text-to-SQL evaluation platforms play a pivotal role in tracking the state of the art of Text-to-SQL performance. The reliability of the evaluation process is critical for driving progress in the field. Current evaluation methods are largely test-based, which involves comparing the execution results of...","url_abs":"https://arxiv.org/abs/2510.26840","url_pdf":"https://arxiv.org/pdf/2510.26840v2","authors":"[\"Rocky Klopfenstein\",\"Yang He\",\"Andrew Tremante\",\"Yuepeng Wang\",\"Nina Narodytska\",\"Haoze Wu\"]","published":"2025-10-30T02:29:54Z","proceeding":"cs.DB","tasks":"[\"cs.DB\",\"cs.AI\",\"cs.FL\",\"cs.LO\"]","methods":"[]","has_code":false}
