Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

cs.LG arXiv:2508.13755
View PDF arXiv JSON

Abstract

Reinforcement Learning with Verifiable Reward (RLVR) is a powerful method for enhancing the reasoning abilities of Large Language Models, but its full potential is limited by a lack of exploration in two key areas: Depth (the difficulty of problems) and Breadth (the number of training instances). Our analysis of the popular GRPO algorithm reveals a bias that down-weights difficult, low-accuracy problems, which are crucial for improving reasoning skills. To address this, we introduce Difficulty Adaptive Rollout Sampling (DARS), a method that re-weights difficult problems by using targeted, multi-stage rollouts. DARS increases the number of rollout outcomes for these harder problems according to our proposed re-balancing schedules and leads to consistent gains in Pass@K. We discovered that increasing rollout size alone does not improve performance and may actually impair it. In contrast, scaling the batch size to increase breadth via full-batch updates significantly boosted Pass@1 metrics. This improvement stems from higher token-level entropy, ensuring robust exploration and minimized gradient noise. We further present DARS-Breadth, a combined approach that uses DARS with a large breadth of training data. This method demonstrates simultaneous gains in both Pass@K and Pass@1, confirming that depth (adaptive exploration) and breadth (scaling iteration instances) are orthogonal and complementary dimensions for unlocking the full power of RLVR.

PDF Viewer