{"ID":2864666,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.23232","arxiv_id":"2509.23232","title":"SPEC-RL: Accelerating On-Policy Reinforcement Learning with Speculative Rollouts","abstract":"Large Language Models (LLMs) increasingly rely on reinforcement learning with verifiable rewards (RLVR) to elicit reliable chain-of-thought reasoning. However, the training process remains bottlenecked by the computationally expensive rollout stage. Existing acceleration methods-such as parallelization, objective- and data-driven modifications, and replay buffers-either incur diminishing returns, introduce bias, or overlook redundancy across iterations. We identify that rollouts from consecutive training epochs frequently share a large portion of overlapping segments, wasting computation. To address this, we propose SPEC-RL, a novel framework that integrates SPECulative decoding with the RL rollout process. SPEC-RL reuses prior trajectory segments as speculative prefixes and extends them via a draft-and-verify mechanism, avoiding redundant generation while ensuring policy consistency. Experiments on diverse math reasoning and generalization benchmarks, including AIME24, MATH-500, OlympiadBench, MMLU-STEM, and others, demonstrate that SPEC-RL reduces rollout time by 2-3x without compromising policy quality. As a purely rollout-stage enhancement, SPEC-RL integrates seamlessly with mainstream algorithms (e.g., PPO, GRPO, DAPO), offering a general and practical path to scale RLVR for large reasoning models. Our code is available at https://github.com/ShopeeLLM/Spec-RL","short_abstract":"Large Language Models (LLMs) increasingly rely on reinforcement learning with verifiable rewards (RLVR) to elicit reliable chain-of-thought reasoning. However, the training process remains bottlenecked by the computationally expensive rollout stage. Existing acceleration methods-such as parallelization, objective- and...","url_abs":"https://arxiv.org/abs/2509.23232","url_pdf":"https://arxiv.org/pdf/2509.23232v3","authors":"[\"Bingshuai Liu\",\"Ante Wang\",\"Zijun Min\",\"Liang Yao\",\"Haibo Zhang\",\"Yang Liu\",\"Xu Han\",\"Peng Li\",\"Anxiang Zeng\",\"Jinsong Su\"]","published":"2025-09-27T10:32:34Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":609182,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2864666,"paper_url":"https://arxiv.org/abs/2509.23232","paper_title":"SPEC-RL: Accelerating On-Policy Reinforcement Learning with Speculative Rollouts","repo_url":"https://github.com/ShopeeLLM/Spec-RL","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
