{"ID":2844751,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.04902","arxiv_id":"2511.04902","title":"You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models","abstract":"Recent advances in large language models have demonstrated the promise of unsupervised reinforcement learning (RL) methods for enhancing reasoning capabilities without external supervision. However, the generalizability of these label-free RL approaches to smaller base models with limited reasoning capabilities remains unexplored. In this work, we systematically investigate the performance of label-free RL methods across different model sizes and reasoning strengths, from 0.5B to 7B parameters. Our empirical analysis reveals critical limitations: label-free RL is highly dependent on the base model's pre-existing reasoning capability, with performance often degrading below baseline levels for weaker models. We find that smaller models fail to generate sufficiently long or diverse chain-of-thought reasoning to enable effective self-reflection, and that training data difficulty plays a crucial role in determining success. To address these challenges, we propose a simple yet effective method for label-free RL that utilizes curriculum learning to progressively introduce harder problems during training and mask no-majority rollouts during training. Additionally, we introduce a data curation pipeline to generate samples with predefined difficulty. Our approach demonstrates consistent improvements across all model sizes and reasoning capabilities, providing a path toward more robust unsupervised RL that can bootstrap reasoning abilities in resource-constrained models. We make our code available at https://github.com/BorealisAI/CuMa","short_abstract":"Recent advances in large language models have demonstrated the promise of unsupervised reinforcement learning (RL) methods for enhancing reasoning capabilities without external supervision. However, the generalizability of these label-free RL approaches to smaller base models with limited reasoning capabilities remains...","url_abs":"https://arxiv.org/abs/2511.04902","url_pdf":"https://arxiv.org/pdf/2511.04902v1","authors":"[\"Shuvendu Roy\",\"Hossein Hajimirsadeghi\",\"Mengyao Zhai\",\"Golnoosh Samei\"]","published":"2025-11-07T01:05:11Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false,"code_links":[{"ID":607320,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2844751,"paper_url":"https://arxiv.org/abs/2511.04902","paper_title":"You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models","repo_url":"https://github.com/BorealisAI/CuMa","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
