{"ID":2832310,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.06547","arxiv_id":"2512.06547","title":"A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation","abstract":"Decoupled PPO has been a successful reinforcement learning (RL) algorithm to deal with the high data staleness under the asynchronous RL setting. Decoupled loss used in decoupled PPO improves coupled-loss style of algorithms' (e.g., standard PPO, GRPO) learning stability by introducing a proximal policy to decouple the off-policy correction (importance weight) from the policy update constraint (trust region). However, the proximal policy requires an extra forward pass through the model at each training step, creating a computational overhead for large language models training. We observe that since the proximal policy only serves as a trust region anchor between the behavior and target policies, we can approximate it through simple interpolation without explicit computation. We call this approach A-3PO (APproximated Proximal Policy Optimization). A-3PO eliminates this overhead, accelerating training by 1.8x speedup while maintaining comparable performance. Code \\\u0026 off-the-shelf example are contributed to the open-source RL training system AReaL at: https://github.com/inclusionAI/AReaL/blob/v1.0.0.rc1/docs/algorithms/prox_approx.md","short_abstract":"Decoupled PPO has been a successful reinforcement learning (RL) algorithm to deal with the high data staleness under the asynchronous RL setting. Decoupled loss used in decoupled PPO improves coupled-loss style of algorithms' (e.g., standard PPO, GRPO) learning stability by introducing a proximal policy to decouple the...","url_abs":"https://arxiv.org/abs/2512.06547","url_pdf":"https://arxiv.org/pdf/2512.06547v3","authors":"[\"Xiaocan Li\",\"Shiliang Wu\",\"Zheng Shen\"]","published":"2025-12-06T19:37:39Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.DC\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":606219,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2832310,"paper_url":"https://arxiv.org/abs/2512.06547","paper_title":"A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation","repo_url":"https://github.com/inclusionAI/AReaL","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
