{"ID":3083856,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T03:54:17.966829144Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05806","arxiv_id":"2606.05806","title":"When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents","abstract":"Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a $2 \\times 2$ taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale $3.66\\times$ slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.","short_abstract":"Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a...","url_abs":"https://arxiv.org/abs/2606.05806","url_pdf":"https://arxiv.org/pdf/2606.05806v1","authors":"[\"Dongsheng Zhu\",\"Xuchen Ma\",\"Yucheng Shen\",\"Xiang Li\",\"Yukun Zhao\",\"Shuaiqiang Wang\",\"Lingyong Yan\",\"Dawei Yin\"]","published":"2026-06-04T07:38:46Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false,"code_links":[{"ID":612840,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-05T06:46:15.197025399Z","DeletedAt":null,"paper_id":3083856,"paper_url":"https://arxiv.org/abs/2606.05806","paper_title":"When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents","repo_url":"https://github.com/Zhudongsheng75/ToolMaze","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
