{"ID":2826813,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.18470","arxiv_id":"2512.18470","title":"SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios","abstract":"Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or adding a small feature. However, real-world software engineering is a long-horizon endeavor: developers interpret high-level requirements, coordinate changes across many files, and evolve codebases over multiple iterations while preserving functionality. We introduce SWE-EVO, a benchmark for this long-horizon software evolution challenge. Constructed from release notes of seven mature open-source Python projects, SWE-EVO comprises 48 tasks requiring multi-step modifications spanning an average of 21 files, validated against test suites averaging 874 tests per instance. Experiments reveal a striking capability gap: GPT-5.4 with OpenHands achieves only 25% on SWE-EVO versus 72.80% achieved by GPT-5.2 on SWE-Bench Verified, showing that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a metric capturing partial progress on these complex, long-horizon tasks.","short_abstract":"Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or adding a small feature. However, real-world software engineering is a long-horizon endeavor: developers interpret high-level requirements, coordinate changes across many files, and evolve codebases over multiple itera...","url_abs":"https://arxiv.org/abs/2512.18470","url_pdf":"https://arxiv.org/pdf/2512.18470v6","authors":"[\"Tue Le\",\"Minh V. T. Thai\",\"Dung Nguyen Manh\",\"Huy Phan Nhat\",\"Nghi D. Q. Bui\"]","published":"2025-12-20T19:08:15Z","proceeding":"cs.SE","tasks":"[\"cs.SE\",\"cs.AI\",\"cs.MA\"]","methods":"[]","has_code":false}
