{"ID":2890739,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.18071","arxiv_id":"2507.18071","title":"Group Sequence Policy Optimization","abstract":"This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.","short_abstract":"This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs seq...","url_abs":"https://arxiv.org/abs/2507.18071","url_pdf":"https://arxiv.org/pdf/2507.18071v2","authors":"[\"Chujie Zheng\",\"Shixuan Liu\",\"Mingze Li\",\"Xiong-Hui Chen\",\"Bowen Yu\",\"Chang Gao\",\"Kai Dang\",\"Yuqiong Liu\",\"Rui Men\",\"An Yang\",\"Jingren Zhou\",\"Junyang Lin\"]","published":"2025-07-24T03:50:32Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false}