{"ID":2859985,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.04996","arxiv_id":"2510.04996","title":"Reinforce-Ada: An Adaptive Sampling Framework under Non-linear RL Objectives","abstract":"Reinforcement learning (RL) for large language model reasoning is frequently hindered by signal loss, a phenomenon where standard uniform sampling with small group sizes fails to uncover informative learning signals for difficult prompts. We demonstrate that this collapse is a statistical artifact of undersampling rather than an inherent model limitation. To address this systematically, we introduce a theoretical framework based on optimizing a non-linear RL objective (e.g., log-likelihood). We show that this objective naturally induces a weighted gradient estimator that prioritizes difficult prompts, which can be robustly realized through adaptive sampling. Guided by this framework, we propose Reinforce-Ada, a family of algorithms that dynamically allocates inference budgets based on prompt difficulty, effectively scaling up RL compute to where it is needed most. Unlike passive filtering methods that discard low-signal prompts, Reinforce-Ada actively invests compute to recover them. We introduce two efficient realizations: an estimation-based approach and a model-free sequential sampling approach. Extensive experiments across multiple benchmarks show that Reinforce-Ada significantly outperforms uniform baselines like GRPO, recovering lost signals and accelerating convergence by up to $2\\times$ while maintaining the same total inference budget. Code is available at https://github.com/RLHFlow/Reinforce-Ada.","short_abstract":"Reinforcement learning (RL) for large language model reasoning is frequently hindered by signal loss, a phenomenon where standard uniform sampling with small group sizes fails to uncover informative learning signals for difficult prompts. We demonstrate that this collapse is a statistical artifact of undersampling rath...","url_abs":"https://arxiv.org/abs/2510.04996","url_pdf":"https://arxiv.org/pdf/2510.04996v3","authors":"[\"Wei Xiong\",\"Chenlu Ye\",\"Baohao Liao\",\"Hanze Dong\",\"Xinxing Xu\",\"Christof Monz\",\"Jiang Bian\",\"Nan Jiang\",\"Tong Zhang\"]","published":"2025-10-06T16:34:09Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CL\",\"stat.ML\"]","methods":"[\"Reinforcement Learning\",\"Language Model\",\"RLHF\"]","has_code":false,"code_links":[{"ID":608687,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2859985,"paper_url":"https://arxiv.org/abs/2510.04996","paper_title":"Reinforce-Ada: An Adaptive Sampling Framework under Non-linear RL Objectives","repo_url":"https://github.com/RLHFlow/Reinforce-Ada","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
