Periodic Asynchrony: An On-Policy Approach for Accelerating LLM Reinforcement Learning

cs.LG arXiv:2511.18871
View PDF arXiv JSON

Abstract

Since the introduction of the GRPO algorithm, reinforcement learning (RL) has attracted increasing attention for LLM post-training, yet training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are co-located on the same devices, and their synchronous execution prevents concurrent inference and training. In this work, we revisit the strategy of separating inference and training deployment, and propose a periodically asynchronous framework that transforms synchronous RL training into an asynchronous producer-consumer pipeline. By synchronising model weights at the beginning of each training iteration and generating all rollouts from the same policy, the proposed framework remains inherently on-policy -- without any modification to standard RL algorithms -- thereby avoiding the off-policy bias introduced by existing asynchronous approaches. We further introduce a unified tri-model architecture and a shared-prompt attention mechanism to support efficient asynchronous execution and reduce redundant computation. Experiments on NPU platforms show approximately 2x throughput improvement from asynchronous execution, with additional gains from system-level optimisations, substantially outperforming mainstream RL frameworks in end-to-end throughput, with speedups of up to 3x on GPU platforms, further confirming cross-architecture generalisability while maintaining comparable accuracy. The proposed framework thus offers a practical, algorithm-agnostic solution for scalable RL post-training without sacrificing on-policy correctness. Code available at: https://github.com/janelu9/EasyLLM

PDF Viewer