{"ID":2854540,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.14509","arxiv_id":"2510.14509","title":"E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task","abstract":"The rapid advancement in large language models (LLMs) has demonstrated significant potential in End-to-End Software Development (E2ESD). However, existing E2ESD benchmarks are limited by coarse-grained requirement specifications and unreliable evaluation protocols, hindering a true understanding of current framework capabilities. To address these limitations, we present E2EDev, a novel benchmark grounded in the principles of Behavior-Driven Development (BDD), which evaluates the capabilities of E2ESD frameworks by assessing whether the generated software meets user needs through mimicking real user interactions (Figure 1). E2EDev comprises (i) a fine-grained set of user requirements, (ii) multiple BDD test scenarios with corresponding Python step implementations for each requirement, and (iii) a fully automated testing pipeline built on the Behave framework. To ensure its quality while reducing the annotation effort, E2EDev leverages our proposed Human-in-the-Loop Multi-Agent Annotation Framework (HITL-MAA). By evaluating various E2ESD frameworks and LLM backbones with E2EDev, our analysis reveals a persistent struggle to effectively solve these tasks, underscoring the critical need for more effective and cost-efficient E2ESD solutions. Our codebase and benchmark are publicly available at https://github.com/SCUNLP/E2EDev.","short_abstract":"The rapid advancement in large language models (LLMs) has demonstrated significant potential in End-to-End Software Development (E2ESD). However, existing E2ESD benchmarks are limited by coarse-grained requirement specifications and unreliable evaluation protocols, hindering a true understanding of current framework ca...","url_abs":"https://arxiv.org/abs/2510.14509","url_pdf":"https://arxiv.org/pdf/2510.14509v4","authors":"[\"Jingyao Liu\",\"Chen Huang\",\"Zhizhao Guan\",\"Wenqiang Lei\",\"Yang Deng\"]","published":"2025-10-16T09:54:26Z","proceeding":"cs.SE","tasks":"[\"cs.SE\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":608165,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2854540,"paper_url":"https://arxiv.org/abs/2510.14509","paper_title":"E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task","repo_url":"https://github.com/SCUNLP/E2EDev","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
