{"ID":3083845,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T05:32:54.120957816Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05784","arxiv_id":"2606.05784","title":"TAPO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents","abstract":"We identify and formally characterize credit misassignment as a systematic failure mode of GRPO in tool-augmented multimodal search agents: its uniform broadcast of trajectory-level advantages to all tokens causes valuable tool-use steps in failing trajectories to be penalized no differently from valueless ones. We further empirically quantify the scale of this phenomenon. Over half of failing trajectories and failing tool-use actions exhibit correctable credit misassignment, demonstrating that the wasted training signal is both substantial and structurally exploitable. Building on this insight, we propose Tool-Aware Policy Optimization (TAPO), which exploits the parameter-determinism property of information-acquisition tools: similar call parameters define equivalent information-acquisition actions and should therefore share comparable action credit. TAPO constructs counterfactual witnesses within the current training batch and compensates misassigned negative credit via confidence-gated conservative advantage correction. It requires no additional annotation, models, or sampling, and introduces negligible computational overhead. Across multiple multimodal search benchmarks, TAPO delivers consistent, plug-and-play improvements over strong baselines for three mainstream RL algorithms (GRPO, GSPO, and SAPO). Our code and models will be publicly released upon acceptance.","short_abstract":"We identify and formally characterize credit misassignment as a systematic failure mode of GRPO in tool-augmented multimodal search agents: its uniform broadcast of trajectory-level advantages to all tokens causes valuable tool-use steps in failing trajectories to be penalized no differently from valueless ones. We fur...","url_abs":"https://arxiv.org/abs/2606.05784","url_pdf":"https://arxiv.org/pdf/2606.05784v1","authors":"[\"Chengqi Dong\",\"Chuhuai Yue\",\"Hang He\",\"yandong liu\",\"Fenghe Tang\",\"S Kevin Zhou\",\"Xiaohan Wang\",\"Jiajun Chai\",\"Guojun Yin\"]","published":"2026-06-04T07:15:43Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[]","has_code":false}
