{"ID":2885741,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.04349","arxiv_id":"2508.04349","title":"GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy","abstract":"Reinforcement Learning (RL) is pivotal for enhancing Large Language Model (LLM) reasoning, yet mainstream algorithms such as GRPO and DAPO remain constrained by a coarse-grained credit assignment paradigm, where all tokens within the same response receive the identical reward. In this paper, we propose Dynamic Entropy Weighting, systematically define entropy-based weight ratios $\\frac{H_{i,t}}{\\sum_{k=1}^{n} H_{k,t}}$ and similar variants to redistribute rewards and get fine-grained rewards through two new algorithms: Group Token Policy Optimization (GTPO), which assigns an entropy-weighted reward to each token and synthesizes token-specific advantage function to drive the model toward optimal path, and the analogous algorithm Sequence-Level GRPO (GRPO-S), which extends this design to the sequence level and exhibits superior stability in long Chain-of-Thought (CoT) reasoning tasks.","short_abstract":"Reinforcement Learning (RL) is pivotal for enhancing Large Language Model (LLM) reasoning, yet mainstream algorithms such as GRPO and DAPO remain constrained by a coarse-grained credit assignment paradigm, where all tokens within the same response receive the identical reward. In this paper, we propose Dynamic Entropy...","url_abs":"https://arxiv.org/abs/2508.04349","url_pdf":"https://arxiv.org/pdf/2508.04349v6","authors":"[\"Hongze Tan\",\"Zihan Wang\",\"Jianfei Pan\",\"Jinghao Lin\",\"Hao Wang\",\"Yifan Wu\",\"Tao Chen\",\"Zhihang Zheng\",\"Zhihao Tang\",\"Haihua Yang\"]","published":"2025-08-06T11:42:47Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\"]","has_code":false}