{"ID":2864986,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.21826","arxiv_id":"2509.21826","title":"ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models","abstract":"Large language models (LLMs) transcend passive generation and act as goal-directed agents by invoking external tools. Reinforcement learning (RL) offers a principled framework for optimizing these emergent tool-use policies, yet the prevailing paradigm relies exclusively on sparse outcome rewards and lacks consideration of the particularity of tool-use tasks, inflating policy-gradient variance and resulting in inefficient training. To better understand and address these challenges, we first establish a theoretical link between policy entropy and training stability of tool-use tasks, which reveals that structured, low-entropy tokens are primary determinants of rewards. Motivated by this insight, we propose \\textbf{Res}haped \\textbf{T}oken-level policy gradients (\\textbf{ResT}) for tool-use tasks. ResT reshapes the policy gradient through entropy-informed token reweighting, progressively upweighting reasoning tokens as training proceeds. This entropy-aware scheme enables a smooth shift from structural correctness to semantic reasoning and stabilizes convergence in multi-turn tool-use tasks. Evaluation on BFCL and API-Bank shows that ResT achieves state-of-the-art results, outperforming prior methods by up to $8.76\\%$. When fine-tuned on a 4B base LLM, ResT further surpasses GPT-4o by $4.11\\%$ on single-turn tasks and $1.50\\%$ on multi-turn base tasks. Code is available at https://github.com/1229095296/ResT_Tool_use_LLM.git.","short_abstract":"Large language models (LLMs) transcend passive generation and act as goal-directed agents by invoking external tools. Reinforcement learning (RL) offers a principled framework for optimizing these emergent tool-use policies, yet the prevailing paradigm relies exclusively on sparse outcome rewards and lacks consideratio...","url_abs":"https://arxiv.org/abs/2509.21826","url_pdf":"https://arxiv.org/pdf/2509.21826v2","authors":"[\"Zihan Lin\",\"Xiaohan Wang\",\"Jie Cao\",\"Jiajun Chai\",\"Guojun Yin\",\"Wei Lin\",\"Ran He\"]","published":"2025-09-26T03:38:27Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":609226,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2864986,"paper_url":"https://arxiv.org/abs/2509.21826","paper_title":"ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models","repo_url":"https://github.com/1229095296/ResT_Tool_use_LLM.git","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
