{"ID":2861570,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.02172","arxiv_id":"2510.02172","title":"RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization","abstract":"Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce RESTRAIN (REinforcement learning with Self-restraint), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model's entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. The self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it improves Pass@1 by up to +140.7 percent on AIME25, +36.2 percent on MMLU_STEM, and +19.6 percent on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.","short_abstract":"Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled d...","url_abs":"https://arxiv.org/abs/2510.02172","url_pdf":"https://arxiv.org/pdf/2510.02172v1","authors":"[\"Zhaoning Yu\",\"Will Su\",\"Leitian Tao\",\"Haozhu Wang\",\"Aashu Singh\",\"Hanchao Yu\",\"Jianyu Wang\",\"Hongyang Gao\",\"Weizhe Yuan\",\"Jason Weston\",\"Ping Yu\",\"Jing Xu\"]","published":"2025-10-02T16:24:01Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Reinforcement Learning\"]","has_code":false}
