{"ID":2899114,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.01551","arxiv_id":"2507.01551","title":"Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning","abstract":"Process Reinforcement Learning~(PRL) has demonstrated considerable potential in enhancing the reasoning capabilities of Large Language Models~(LLMs). However, introducing additional process reward models incurs substantial computational overhead, and there is no unified theoretical framework for process-level advantage estimation. To bridge this gap, we propose \\textbf{S}elf-Guided \\textbf{P}rocess \\textbf{R}eward \\textbf{O}ptimization~(\\textbf{SPRO}), a novel framework that enables process-aware RL through two key innovations: (1) we first theoretically demonstrate that process rewards can be derived intrinsically from the policy model itself, and (2) we introduce well-defined cumulative process rewards and \\textbf{M}asked \\textbf{S}tep \\textbf{A}dvantage (\\textbf{MSA}), which facilitates rigorous step-wise action advantage estimation within shared-prompt sampling groups. Our experimental results demonstrate that SPRO outperforms vaniila GRPO with 3.4x higher training efficiency and a 17.5\\% test accuracy improvement. Furthermore, SPRO maintains a stable and elevated policy entropy throughout training while reducing the average response length by approximately $1/3$, evidencing sufficient exploration and prevention of reward hacking. Notably, SPRO incurs no additional computational overhead compared to outcome-supervised RL methods such as GRPO, which benefit industrial implementation.","short_abstract":"Process Reinforcement Learning~(PRL) has demonstrated considerable potential in enhancing the reasoning capabilities of Large Language Models~(LLMs). However, introducing additional process reward models incurs substantial computational overhead, and there is no unified theoretical framework for process-level advantage...","url_abs":"https://arxiv.org/abs/2507.01551","url_pdf":"https://arxiv.org/pdf/2507.01551v2","authors":"[\"Wu Fei\",\"Hao Kong\",\"Shuxian Liang\",\"Yang Lin\",\"Yibo Yang\",\"Jing Tang\",\"Lei Chen\",\"Xiansheng Hua\"]","published":"2025-07-02T10:05:14Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\",\"LoRA\"]","has_code":false}
