{"ID":2854815,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.14967","arxiv_id":"2510.14967","title":"Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents","abstract":"Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided exclusively upon generating the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate three critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals; (ii) lack of fine-grained credit assignment, where the correctness of intermediate turns is obscured, especially in long-horizon tasks; and (iii) poor sample efficiency, where each rollout yields only a single outcome signal, leading to low data utilization. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy's probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model's own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward signals. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved data efficiency. Our code is available at https://github.com/GuoqingWang1/IGPO.","short_abstract":"Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically re...","url_abs":"https://arxiv.org/abs/2510.14967","url_pdf":"https://arxiv.org/pdf/2510.14967v2","authors":"[\"Guoqing Wang\",\"Sunhao Dai\",\"Guangze Ye\",\"Zeyu Gan\",\"Wei Yao\",\"Yong Deng\",\"Xiaofeng Wu\",\"Zhenzhe Ying\"]","published":"2025-10-16T17:59:32Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":608198,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2854815,"paper_url":"https://arxiv.org/abs/2510.14967","paper_title":"Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents","repo_url":"https://github.com/GuoqingWang1/IGPO","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
