{"ID":2836170,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.07850","arxiv_id":"2512.07850","title":"SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents","abstract":"Despite rapid progress in LLM agents, performance on long-horizon, tool-using tasks remains fragile. To better understand this fragility, we ask a simple question: \\emph{do all actions contribute equally to failure?} Analyzing execution traces on $τ$-Bench (Airline/Retail) and SWE-Bench Verified, we decompose trajectories into \\emph{mutating} (environment-changing) vs.\\ non-mutating steps and formalize \\emph{decisive deviations}, earliest action, level divergences that flip success to failure. A logistic regression reveals that each additional deviation in a mutating action reduces the odds of success by upto $92\\%$ on Airline and upto $96\\%$ on Retail for SoTA models. In contrast, deviations in non-mutating actions have little to no effect. Errors also grow with context length as agents drift from role and act on stale constraints. Motivated by these observations, we introduce \\cm{}, a model-agnostic, gradient-free, test-time safeguard that (i) adds mutation-gated verification, (ii) injects \\emph{Targeted Reflection} before mutating steps, and (iii) performs block-based context cleaning. \\cm{} delivers consistent gains, e.g., Qwen3-Thinking: +28\\% \\emph{relative} on Airline, +11\\% on Retail, and +7\\% on SWE-Bench Verified; Claude: +9\\%/+7\\%. We further identify ceiling effects in $τ$-Bench, where annotation errors and underspecified tasks artificially cap model performance. To address this, we release $τ$-Bench Verified, which restores benchmark headroom through targeted revisions. Our results argue for action-level analysis, targeted safeguards, and reliable evaluations as prerequisites for robust multi-turn agents.","short_abstract":"Despite rapid progress in LLM agents, performance on long-horizon, tool-using tasks remains fragile. To better understand this fragility, we ask a simple question: \\emph{do all actions contribute equally to failure?} Analyzing execution traces on $τ$-Bench (Airline/Retail) and SWE-Bench Verified, we decompose trajector...","url_abs":"https://arxiv.org/abs/2512.07850","url_pdf":"https://arxiv.org/pdf/2512.07850v1","authors":"[\"Alejandro Cuadron\",\"Pengfei Yu\",\"Yang Liu\",\"Arpit Gupta\"]","published":"2025-11-26T01:28:22Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false}