{"ID":2868069,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.16941","arxiv_id":"2509.16941","title":"SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?","abstract":"We introduce SWE-Bench Pro, a substantially more challenging benchmark that builds upon the best practices of SWE-BENCH [25], but is explicitly designed to capture realistic, complex, enterprise-level problems beyond the scope of SWE-BENCH. SWE-BENCH PRO contains 1,865 problems sourced from a diverse set of 41 actively maintained repositories spanning business applications, B2B services, and developer tools. The benchmark is partitioned into a public set with open access to problems sourced from 11 repositories, a held-out set of 12 repositories and a commercial set of 18 proprietary repositories where we have formal partnership agreements with early-stage startups. Problems in the held-out and the commercial set are not publicly accessible, but we release results on the commercial set. Our benchmark features long-horizon tasks that may require hours to days for a professional software engineer to complete, often involving patches across multiple files and substantial code modifications. All tasks are human-verified and augmented with sufficient context to ensure resolvability. To better understand these limitations, we cluster the failure modes observed in the collected agent trajectories for a clearer characterization of the error patterns exhibited by current models. Overall, SWE-BENCH PRO provides a contamination-resistant testbed that more faithfully captures the complexity and diversity of real-world software development, advancing the pursuit of truly autonomous software engineering agents at a professional level.","short_abstract":"We introduce SWE-Bench Pro, a substantially more challenging benchmark that builds upon the best practices of SWE-BENCH [25], but is explicitly designed to capture realistic, complex, enterprise-level problems beyond the scope of SWE-BENCH. SWE-BENCH PRO contains 1,865 problems sourced from a diverse set of 41 actively...","url_abs":"https://arxiv.org/abs/2509.16941","url_pdf":"https://arxiv.org/pdf/2509.16941v2","authors":"[\"Xiang Deng\",\"Jeff Da\",\"Edwin Pan\",\"Yannis Yiming He\",\"Charles Ide\",\"Kanak Garg\",\"Niklas Lauffer\",\"Andrew Park\",\"Nitin Pasari\",\"Chetan Rane\",\"Karmini Sampath\",\"Maya Krishnan\",\"Srivatsa Kundurthy\",\"Sean Hendryx\",\"Zifan Wang\",\"Vijay Bharadwaj\",\"Jeff Holm\",\"Raja Aluri\",\"Chen Bo Calvin Zhang\",\"Noah Jacobson\",\"Bing Liu\",\"Brad Kenstler\"]","published":"2025-09-21T06:28:17Z","proceeding":"cs.SE","tasks":"[\"cs.SE\",\"cs.CL\"]","methods":"[]","has_code":false}
