{"ID":2863895,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.25370","arxiv_id":"2509.25370","title":"Where LLM Agents Fail and How They can Learn From Failures","abstract":"Large Language Model (LLM) agents, which integrate planning, memory, reflection, and tool-use modules, have shown promise in solving complex, multi-step tasks. Yet their sophisticated architectures amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions, leading to task failure. Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way, and therefore fail to detect these errors accordingly. We address this gap with three contributions. First, we introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations. Second, we construct AgentErrorBench, the first dataset of systematically annotated failure trajectories from ALFWorld, GAIA, and WebShop, grounding error analysis in real-world agent rollouts. Third, we propose AgentDebug, a debugging framework that isolates root-cause failures and provides corrective feedback, enabling agents to recover and iteratively improve. Experiments on AgentErrorBench show that AgentDebug achieves 24% higher all-correct accuracy and 17% higher step accuracy compared to the strongest baseline. Beyond detection, the targeted feedback generated by AgentDebug enables LLM agents to iteratively recover from failures, yielding up to 26% relative improvements in task success across ALFWorld, GAIA, and WebShop. These results establish principled debugging as a pathway to more reliable and adaptive LLM agents. The code and data will be available at https://github.com/ulab-uiuc/AgentDebug","short_abstract":"Large Language Model (LLM) agents, which integrate planning, memory, reflection, and tool-use modules, have shown promise in solving complex, multi-step tasks. Yet their sophisticated architectures amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions, lead...","url_abs":"https://arxiv.org/abs/2509.25370","url_pdf":"https://arxiv.org/pdf/2509.25370v1","authors":"[\"Kunlun Zhu\",\"Zijia Liu\",\"Bingxuan Li\",\"Muxin Tian\",\"Yingxuan Yang\",\"Jiaxun Zhang\",\"Pengrui Han\",\"Qipeng Xie\",\"Fuyang Cui\",\"Weijia Zhang\",\"Xiaoteng Ma\",\"Xiaodong Yu\",\"Gowtham Ramesh\",\"Jialian Wu\",\"Zicheng Liu\",\"Pan Lu\",\"James Zou\",\"Jiaxuan You\"]","published":"2025-09-29T18:20:27Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":609082,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2863895,"paper_url":"https://arxiv.org/abs/2509.25370","paper_title":"Where LLM Agents Fail and How They can Learn From Failures","repo_url":"https://github.com/ulab-uiuc/AgentDebug","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
