{"ID":2856023,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.13859","arxiv_id":"2510.13859","title":"Benchmarking Correctness and Security in Multi-Turn Code Generation","abstract":"AI coding assistants powered by large language models (LLMs) have transformed software development, significantly boosting productivity. While existing benchmarks evaluate the correctness and security of LLM-generated code, they are typically limited to single-turn tasks that do not reflect the iterative nature of real-world development. We introduce MT-Sec, the first benchmark to systematically evaluate both correctness and security in multi-turn coding scenarios. We construct this using a synthetic data pipeline that transforms existing single-turn tasks into semantically aligned multi-turn interaction sequences, allowing reuse of original test suites while modeling the complexity of real-world coding processes. We evaluate 32 open- and closed-source models, and three agent-scaffolding on MT-Sec and observe a consistent 20-27% drop in \"correct and secure\" outputs from single-turn to multi-turn settings -- even among state-of-the-art models. Beyond full-program generation, we also evaluate models on multi-turn code-diff generation -- an unexplored yet practically relevant setting -- and find that models perform worse here, with increased rates of functionally incorrect and insecure outputs. Finally, we find that while agent scaffoldings boost single-turn code generation performance, they are not quite as effective in multi-turn evaluations. Together, these findings highlight the need for benchmarks that jointly evaluate correctness and security in multi-turn, real-world coding workflows.","short_abstract":"AI coding assistants powered by large language models (LLMs) have transformed software development, significantly boosting productivity. While existing benchmarks evaluate the correctness and security of LLM-generated code, they are typically limited to single-turn tasks that do not reflect the iterative nature of real...","url_abs":"https://arxiv.org/abs/2510.13859","url_pdf":"https://arxiv.org/pdf/2510.13859v1","authors":"[\"Ruchit Rawal\",\"Jeffrey Yang Fan Chiang\",\"Chihao Shen\",\"Jeffery Siyuan Tian\",\"Aastha Mahajan\",\"Tom Goldstein\",\"Yizheng Chen\"]","published":"2025-10-13T01:20:46Z","proceeding":"cs.SE","tasks":"[\"cs.SE\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}