{"ID":2865622,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.22908","arxiv_id":"2509.22908","title":"A benchmark for vericoding: formally verified program synthesis","abstract":"We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications - in contrast to vibe coding, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in Verus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68% to 96% over the past year. The benchmark and vericoding results are shared at https://github.com/Beneficial-AI-Foundation/vericoding-benchmark","short_abstract":"We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications - in contrast to vibe coding, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in Verus...","url_abs":"https://arxiv.org/abs/2509.22908","url_pdf":"https://arxiv.org/pdf/2509.22908v1","authors":"[\"Sergiu Bursuc\",\"Theodore Ehrenborg\",\"Shaowei Lin\",\"Lacramioara Astefanoaei\",\"Ionel Emilian Chiosa\",\"Jure Kukovec\",\"Alok Singh\",\"Oliver Butterley\",\"Adem Bizid\",\"Quinn Dougherty\",\"Miranda Zhao\",\"Max Tan\",\"Max Tegmark\"]","published":"2025-09-26T20:36:20Z","proceeding":"cs.SE","tasks":"[\"cs.SE\",\"cs.LG\",\"cs.PL\"]","methods":"[\"Large Language Model\"]","has_code":false,"code_links":[{"ID":609291,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2865622,"paper_url":"https://arxiv.org/abs/2509.22908","paper_title":"A benchmark for vericoding: formally verified program synthesis","repo_url":"https://github.com/Beneficial-AI-Foundation/vericoding-benchmark","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}