{"ID":2864534,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.23061","arxiv_id":"2509.23061","title":"Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification","abstract":"We introduce DafnyCOMP, a benchmark for evaluating large language models (LLMs) on compositional specification generation in Dafny. Unlike prior benchmarks that focus on single-function tasks, DafnyCOMP targets programs composed of multiple interacting functions with data dependencies, requiring reasoning across component boundaries. The benchmark consists of 300 automatically synthesized multi-function programs. We evaluate several state-of-the-art LLM families and find that, while they perform well on single-function verification, their performance drops sharply on compositional tasks. Analysis reveals systematic failures in cross-functional reasoning, including fragile specifications, misalignment between implementations and proofs, and unstable reasoning. DafnyCOMP thus provides a diagnostic tool for measuring progress toward reliable, verifiable, and compositional code generation with LLMs.","short_abstract":"We introduce DafnyCOMP, a benchmark for evaluating large language models (LLMs) on compositional specification generation in Dafny. Unlike prior benchmarks that focus on single-function tasks, DafnyCOMP targets programs composed of multiple interacting functions with data dependencies, requiring reasoning across compon...","url_abs":"https://arxiv.org/abs/2509.23061","url_pdf":"https://arxiv.org/pdf/2509.23061v1","authors":"[\"Xu Xu\",\"Xin Li\",\"Xingwei Qu\",\"Jie Fu\",\"Binhang Yuan\"]","published":"2025-09-27T02:33:08Z","proceeding":"cs.PL","tasks":"[\"cs.PL\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}