{"ID":2870088,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.14480","arxiv_id":"2509.14480","title":"Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents","abstract":"Effective interactive tool use requires agents to master Tool Integrated Reasoning (TIR): a complex process involving multi-turn planning and long-context dialogue management. To train agents for this dynamic process, particularly in multi-modal contexts, we introduce a sandbox environment for reinforcement learning (RL) that supports interleaved speech-text rollouts. Our core strategy, Turn-level Adjudicated Reinforcement Learning (TARL), addresses the challenge of credit assignment in long-horizon tasks by employing a Large Language Model (LLM) as a judge to provide turn-level evaluation. To enhance exploration, we integrate a mixed-task training curriculum with mathematical reasoning problems. This unified approach boosts the task pass rate on the text-based $τ$-bench by over 6% compared to strong RL baselines. Crucially, we demonstrate our framework's suitability for fine-tuning a multi-modal foundation model for agentic tasks. By training a base multi-modal LLM on interleaved speech-text rollouts, we equip it with tool-use abilities, paving the way for more natural, voice-driven interactive agents.","short_abstract":"Effective interactive tool use requires agents to master Tool Integrated Reasoning (TIR): a complex process involving multi-turn planning and long-context dialogue management. To train agents for this dynamic process, particularly in multi-modal contexts, we introduce a sandbox environment for reinforcement learning (R...","url_abs":"https://arxiv.org/abs/2509.14480","url_pdf":"https://arxiv.org/pdf/2509.14480v1","authors":"[\"Weiting Tan\",\"Xinghua Qu\",\"Ming Tu\",\"Meng Ge\",\"Andy T. Liu\",\"Philipp Koehn\",\"Lu Lu\"]","published":"2025-09-17T23:25:00Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.MA\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\",\"LoRA\"]","has_code":false}
