{"ID":3083922,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T05:32:54.120957816Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05920","arxiv_id":"2606.05920","title":"Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement","abstract":"Existing code-generation benchmarks score a single mapping from a complete prompt to a one-shot output. However, real web development is different. Users seldom write a full spec at the start; many requirements only become clear once they look at an intermediate result and react to it. We present Asuka-Bench, a benchmark that pairs underspecified user intent with multi-round refinement, grounded in browser-rendered behavior. Each task is resolved through a closed loop: a Code Agent generates a web project, a UI Agent executes test cases on the deployed site, and a User LLM turns evaluation outcomes into natural-language feedback for the next round. The benchmark comprises 50 web tasks with 784 evaluation criteria and 2402 expected outcomes. We benchmark 8 LLMs across 2 agent frameworks. The results separate models clearly: weighted Task Pass Rate varies by 38 percentage points and models also differ substantially in their ability to repair from feedback. Asuka-Bench is also far from saturated: even the strongest model completes only 52% of projects after three rounds.","short_abstract":"Existing code-generation benchmarks score a single mapping from a complete prompt to a one-shot output. However, real web development is different. Users seldom write a full spec at the start; many requirements only become clear once they look at an intermediate result and react to it. We present Asuka-Bench, a benchma...","url_abs":"https://arxiv.org/abs/2606.05920","url_pdf":"https://arxiv.org/pdf/2606.05920v1","authors":"[\"Xin Wang\",\"Liangtai Sun\",\"Yaoming Zhu\",\"Shuang Zhou\",\"Jiaxing Liu\",\"Fengjiao Chen\",\"Lin Qiu\",\"Xuezhi Cao\",\"Xunliang Cai\",\"Licheng Zhang\",\"Zhendong Mao\"]","published":"2026-06-04T09:24:30Z","proceeding":"cs.SE","tasks":"[\"cs.SE\",\"cs.CL\"]","methods":"[\"Large Language Model\"]","has_code":false}
