{"ID":2848450,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.00066","arxiv_id":"2511.00066","title":"Sharpness-Guided Group Relative Policy Optimization via Probability Shaping","abstract":"Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. However, RLVR training is typically performed with limited control over generalization. We revisit GRPO through a robustness-based generalization view, where the generalization loss is upper bounded by a combination of the empirical loss and a sharpness surrogate measured by the gradient norm. Building on this perspective, we propose Sharpness-Guided GRPO (GRPO-SG), a simple token-weighted variant of GRPO that downweights tokens likely to cause overly large gradients, reducing sharp updates and stabilizing optimization, thereby improving generalization. Experiments across mathematical reasoning, logic puzzles and tool-augmented question answering show consistent improvements over GRPO, along with smoother gradient-norm trajectories, supporting GRPO-SG as a simple and effective generalization-oriented upgrade to GRPO for RLVR.","short_abstract":"Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. However, RLVR training is typically performed with limited control over generalization. We revisit GRPO...","url_abs":"https://arxiv.org/abs/2511.00066","url_pdf":"https://arxiv.org/pdf/2511.00066v4","authors":"[\"Tue Le\",\"Linh Ngo Van\",\"Trung Le\"]","published":"2025-10-29T08:07:47Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false}