{"ID":2921038,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-04T07:41:34.29888543Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.01934","arxiv_id":"2606.01934","title":"HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression","abstract":"Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipelines, and fragile scalability restricted to small models. We propose HMPO (Hybrid Median-length Policy Optimization), a cost-effective, single-stage reinforcement learning framework. HMPO efficiently compresses CoT via three synergistic components: an adaptive median-based budget derived from successful rollouts to eliminate manual tuning, a cosine-decay token reward for smooth length penalization, and a multiplicative reward formulation that substantially mitigates trivial reward hacking by strictly prioritizing answer correctness. Trained exclusively on mathematical data, HMPO generalizes seamlessly across math, code, science, and instruction-following tasks. Extensive experiments scaling from 9B to 122B parameters across dense and Mixture-of-Experts (MoE) architectures demonstrate that HMPO achieves 19%--46% token compression with negligible accuracy degradation, all while drastically reducing training costs compared to existing multi-stage baselines.","short_abstract":"Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipelines, and fragile scala...","url_abs":"https://arxiv.org/abs/2606.01934","url_pdf":"https://arxiv.org/pdf/2606.01934v1","authors":"[\"Minghui Zheng\",\"Hongxu Chen\",\"Huimin Ren\",\"Hongsheng Xin\",\"Xiaoyang Qu\",\"Ze Wang\",\"Shuling Yang\",\"Ziyu Peng\",\"Kaike Zhang\",\"Pan Zhou\",\"Kun Zhan\"]","published":"2026-06-01T09:01:55Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CL\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false}