{"ID":2880649,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.13755","arxiv_id":"2508.13755","title":"Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration","abstract":"Reinforcement Learning with Verifiable Reward (RLVR) is a powerful method for enhancing the reasoning abilities of Large Language Models, but its full potential is limited by a lack of exploration in two key areas: Depth (the difficulty of problems) and Breadth (the number of training instances). Our analysis of the popular GRPO algorithm reveals a bias that down-weights difficult, low-accuracy problems, which are crucial for improving reasoning skills. To address this, we introduce Difficulty Adaptive Rollout Sampling (DARS), a method that re-weights difficult problems by using targeted, multi-stage rollouts. DARS increases the number of rollout outcomes for these harder problems according to our proposed re-balancing schedules and leads to consistent gains in Pass@K. We discovered that increasing rollout size alone does not improve performance and may actually impair it. In contrast, scaling the batch size to increase breadth via full-batch updates significantly boosted Pass@1 metrics. This improvement stems from higher token-level entropy, ensuring robust exploration and minimized gradient noise. We further present DARS-Breadth, a combined approach that uses DARS with a large breadth of training data. This method demonstrates simultaneous gains in both Pass@K and Pass@1, confirming that depth (adaptive exploration) and breadth (scaling iteration instances) are orthogonal and complementary dimensions for unlocking the full power of RLVR.","short_abstract":"Reinforcement Learning with Verifiable Reward (RLVR) is a powerful method for enhancing the reasoning abilities of Large Language Models, but its full potential is limited by a lack of exploration in two key areas: Depth (the difficulty of problems) and Breadth (the number of training instances). Our analysis of the po...","url_abs":"https://arxiv.org/abs/2508.13755","url_pdf":"https://arxiv.org/pdf/2508.13755v8","authors":"[\"Zhicheng Yang\",\"Zhijiang Guo\",\"Yinya Huang\",\"Yongxin Wang\",\"Dongchun Xie\",\"Hanhui Li\",\"Yiwei Wang\",\"Xiaodan Liang\",\"Jing Tang\"]","published":"2025-08-19T11:51:40Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\",\"LoRA\"]","has_code":false}