{"ID":2826934,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2601.06052","arxiv_id":"2601.06052","title":"Reinforcement Learning for Chain of Thought Compression with One-Domain-to-All Generalization","abstract":"Chain-of-thought reasoning in large language models can trigger an \"overthinking trap\": longer rollouts raise cost and latency yet often yield unreliable accuracy gains. Existing methods use global, static controls that may suppress needed reasoning. We propose mastery-gated, sample-level, soft reinforcement learning compression that penalizes long rollouts only when the model already solves the problem and has produced a shorter rollout. Across benchmarks, it cuts response length by 20-40% with comparable or higher accuracy and generalizes across domains: a model trained on math spontaneously shortens unseen tasks (code, instruction following, general-knowledge QA) without hurting accuracy. We further show two-way transfer between non-agent CoT and tool-use agents: non-agent training reduces SWE-Bench Verified rounds by 13%, while compressing a thinking agent cuts SWE trajectories by 67% tokens and 52% rounds and shortens non-agent outputs by up to 44%. Compression is thus not cosmetic brevity, but an inherent computation policy -- what to keep, and what to forget.","short_abstract":"Chain-of-thought reasoning in large language models can trigger an \"overthinking trap\": longer rollouts raise cost and latency yet often yield unreliable accuracy gains. Existing methods use global, static controls that may suppress needed reasoning. We propose mastery-gated, sample-level, soft reinforcement learning c...","url_abs":"https://arxiv.org/abs/2601.06052","url_pdf":"https://arxiv.org/pdf/2601.06052v2","authors":"[\"Hanyu Li\",\"Jiangshan Duo\",\"Bofei Gao\",\"Hailin Zhang\",\"Sujian Li\",\"Xiaotie Deng\",\"Liang Zhao\"]","published":"2025-12-19T06:30:54Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false}
