{"ID":2895973,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.07451","arxiv_id":"2507.07451","title":"RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning","abstract":"Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present \\emph{RLEP}\\, -- \\,Reinforcement Learning with Experience rePlay\\, -- \\,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance. On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further research.","short_abstract":"Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present \\emph{RLEP}\\, -- \\,Reinforcement Learning with Experience rePlay\\, -- \\,a two-phase framework that first collects verified traj...","url_abs":"https://arxiv.org/abs/2507.07451","url_pdf":"https://arxiv.org/pdf/2507.07451v1","authors":"[\"Hongzhi Zhang\",\"Jia Fu\",\"Jingyuan Zhang\",\"Kai Fu\",\"Qi Wang\",\"Fuzheng Zhang\",\"Guorui Zhou\"]","published":"2025-07-10T05:58:55Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\",\"LoRA\"]","has_code":false,"code_links":[{"ID":612231,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2895973,"paper_url":"https://arxiv.org/abs/2507.07451","paper_title":"RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning","repo_url":"https://github.com/Kwai-Klear/RLEP","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
