{"ID":2883482,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.07534","arxiv_id":"2508.07534","title":"From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR","abstract":"Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). Unlike traditional RL approaches, RLVR leverages rule-based feedback to guide LLMs in generating and refining complex reasoning chains -- a process critically dependent on effective exploration strategies. While prior work has demonstrated RLVR's empirical success, the fundamental mechanisms governing LLMs' exploration behaviors remain underexplored. This technical report presents a systematic investigation of exploration capacities in RLVR, covering four main aspects: (1) exploration space shaping, where we develop quantitative metrics to characterize LLMs' capability boundaries; (2) entropy-performance exchange, analyzed across training stages, individual instances, and token-level patterns; and (3) RL performance optimization, examining methods to effectively translate exploration gains into measurable improvements. By unifying previously identified insights with new empirical evidence, this work aims to provide a foundational framework for advancing RLVR systems.","short_abstract":"Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). Unlike traditional RL approaches, RLVR leverages rule-based feedback to guide LLMs in generating and refining complex reasoning chains -- a process criticall...","url_abs":"https://arxiv.org/abs/2508.07534","url_pdf":"https://arxiv.org/pdf/2508.07534v2","authors":"[\"Jia Deng\",\"Jie Chen\",\"Zhipeng Chen\",\"Daixuan Cheng\",\"Fei Bai\",\"Beichen Zhang\",\"Yinqian Min\",\"Yanzipeng Gao\",\"Wayne Xin Zhao\",\"Ji-Rong Wen\"]","published":"2025-08-11T01:26:16Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\",\"LoRA\"]","has_code":false}
