{"ID":2845366,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.04439","arxiv_id":"2511.04439","title":"CoRPO: Adding a Correctness Bias to GRPO Improves Generalization","abstract":"Group-Relative Policy Optimization (GRPO) has emerged as the standard for training reasoning capabilities in large language models through reinforcement learning. By estimating advantages using group-mean rewards rather than a learned critic, GRPO has enabled efficient scaling of reinforcement learning from verifiable rewards (RLVR). However, we identify a fundamental limitation: GRPO's mean baseline can assign positive advantages to incorrect solutions simply because they outperform a poorly-performing group average. It leads to overestimation of advantages and reinforcement of incorrect behaviours. To address this, we propose Correctness-Relative Policy Optimization (CoRPO), a simple modification to the GRPO objective that clips the minimum baseline to a fixed correctness threshold. We show that baseline clipping introduces a protective bias to advantage estimation that mitigates overfitting while preserving effective exploration. Empirically, CoRPO-trained models improve cross-domain reasoning, generalizing more consistently to out-of-domain (OOD) tasks. When trained on coding tasks, CoRPO outperforms GRPO on math, and vice-versa, indicating that CoRPO learns robust, transferable reasoning patterns rather than task-specific solutions.","short_abstract":"Group-Relative Policy Optimization (GRPO) has emerged as the standard for training reasoning capabilities in large language models through reinforcement learning. By estimating advantages using group-mean rewards rather than a learned critic, GRPO has enabled efficient scaling of reinforcement learning from verifiable...","url_abs":"https://arxiv.org/abs/2511.04439","url_pdf":"https://arxiv.org/pdf/2511.04439v3","authors":"[\"Anisha Garg\",\"Claire Zhang\",\"Nishit Neema\",\"David Bick\",\"Ganesh Venkatesh\",\"Joel Hestness\"]","published":"2025-11-06T15:12:50Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.LG\"]","methods":"[\"Reinforcement Learning\",\"Language Model\",\"LoRA\"]","has_code":false}