{"ID":2882290,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.10548","arxiv_id":"2508.10548","title":"Stabilizing Long-term Multi-turn Reinforcement Learning with Gated Rewards","abstract":"Reward sparsity in long-horizon reinforcement learning (RL) tasks remains a significant challenge, while existing outcome-based reward shaping struggles to define meaningful immediate rewards without introducing bias or requiring explicit task decomposition. Alternatively, verification-based reward shaping uses stepwise critics, but misalignment between immediate rewards and long-term objectives can lead to reward hacking and suboptimal policies. In this work, we address this problem in the context of software engineering (SWE) tasks, where multi-turn reasoning and rule-based verification are critical. We introduce the SWE-oriented RL Framework, a unified system supporting multi-turn interaction, docker-based execution, and customizable reward functions. Additionally, we propose Gated Reward Accumulation (G-RA), a novel method that accumulates immediate rewards only when high-level (long-term) rewards meet a predefined threshold, ensuring stable RL optimization. Experiments on SWE-bench Verified and kBench demonstrate that G-RA leads to an increase in completion rates (47.6\\% \\rightarrow 93.8\\% and 22.0\\% \\rightarrow 86.0\\%) and modification rates (19.6\\% \\rightarrow 23.8\\% and 12.0\\% \\rightarrow 42.0\\%), while avoiding policy degradation caused by reward misalignment. Our findings highlight the importance of balanced reward accumulation in long-horizon RL and provide a practical solution.","short_abstract":"Reward sparsity in long-horizon reinforcement learning (RL) tasks remains a significant challenge, while existing outcome-based reward shaping struggles to define meaningful immediate rewards without introducing bias or requiring explicit task decomposition. Alternatively, verification-based reward shaping uses stepwis...","url_abs":"https://arxiv.org/abs/2508.10548","url_pdf":"https://arxiv.org/pdf/2508.10548v1","authors":"[\"Zetian Sun\",\"Dongfang Li\",\"Zhuoen Chen\",\"Yuhuai Qin\",\"Baotian Hu\"]","published":"2025-08-14T11:37:02Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Reinforcement Learning\"]","has_code":false}
