{"ID":2824710,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.23075","arxiv_id":"2512.23075","title":"Trust Region Masking for Long-Horizon LLM Reinforcement Learning","abstract":"Policy gradient methods for Large Language Models optimize a policy $π_θ$ via a surrogate objective computed from samples of a rollout policy $π_{\\text{roll}}$. However, modern LLM-RL pipelines suffer from unavoidable implementation divergences -- backend discrepancies, Mixture-of-Experts routing discontinuities, and distributed training staleness -- causing off-policy mismatch ($π_{\\text{roll}} \\neq π_θ$) and approximation errors between the surrogate and the true objective. We demonstrate that classical trust region bounds on this error scale as $O(T^2)$ with sequence length $T$, rendering them vacuous for long-horizon tasks. To address this, we derive a family of bounds -- both KL-based and TV-based -- including a Pinsker-Marginal bound ($O(T^{3/2})$), a Mixed bound ($O(T)$), and an Adaptive bound that strictly generalizes the Pinsker-Marginal bound via per-position importance-ratio decomposition. Taking the minimum over all bounds yields the tightest known guarantee across all divergence regimes. Crucially, all bounds depend on the maximum token-level divergence $D_{\\mathrm{KL}}^{\\mathrm{tok,max}}$ (or $D_{\\mathrm{TV}}^{\\mathrm{tok,max}}$), a sequence-level quantity that cannot be controlled by token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which masks entire sequences violating the trust region, enabling the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL.","short_abstract":"Policy gradient methods for Large Language Models optimize a policy $π_θ$ via a surrogate objective computed from samples of a rollout policy $π_{\\text{roll}}$. However, modern LLM-RL pipelines suffer from unavoidable implementation divergences -- backend discrepancies, Mixture-of-Experts routing discontinuities, and d...","url_abs":"https://arxiv.org/abs/2512.23075","url_pdf":"https://arxiv.org/pdf/2512.23075v4","authors":"[\"Yingru Li\",\"Jiacai Liu\",\"Jiawei Xu\",\"Yuxuan Tong\",\"Ziniu Li\",\"Qian Liu\",\"Baoxiang Wang\"]","published":"2025-12-28T20:41:59Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.IT\",\"stat.ML\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\"]","has_code":false}
