{"ID":2840564,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.13288","arxiv_id":"2511.13288","title":"Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO","abstract":"Multi-agent systems perform well on general reasoning tasks. However, the lack of training in specialized areas hinders their accuracy. Current training methods train a unified large language model (LLM) for all agents in the system. This may limit the performances due to different distributions underlying for different agents. Therefore, training multi-agent systems with distinct LLMs should be the next step to solve. However, this approach introduces optimization challenges. For example, agents operate at different frequencies, rollouts involve varying sub-agent invocations, and agents are often deployed across separate servers, disrupting end-to-end gradient flow. To address these issues, we propose M-GRPO, a hierarchical extension of Group Relative Policy Optimization designed for vertical Multi-agent systems with a main agent (planner) and multiple sub-agents (multi-turn tool executors). M-GRPO computes group-relative advantages for both main and sub-agents, maintaining hierarchical credit assignment. It also introduces a trajectory-alignment scheme that generates fixed-size batches despite variable sub-agent invocations. We deploy a decoupled training pipeline in which agents run on separate servers and exchange minimal statistics via a shared store. This enables scalable training without cross-server backpropagation. In experiments on real-world benchmarks (e.g., GAIA, XBench-DeepSearch, and WebWalkerQA), M-GRPO consistently outperforms both single-agent GRPO and multi-agent GRPO with frozen sub-agents, demonstrating improved stability and sample efficiency. These results show that aligning heterogeneous trajectories and decoupling optimization across specialized agents enhances tool-augmented reasoning tasks.","short_abstract":"Multi-agent systems perform well on general reasoning tasks. However, the lack of training in specialized areas hinders their accuracy. Current training methods train a unified large language model (LLM) for all agents in the system. This may limit the performances due to different distributions underlying for differen...","url_abs":"https://arxiv.org/abs/2511.13288","url_pdf":"https://arxiv.org/pdf/2511.13288v2","authors":"[\"Haoyang Hong\",\"Jiajun Yin\",\"Yuan Wang\",\"Jingnan Liu\",\"Zhe Chen\",\"Ailing Yu\",\"Ji Li\",\"Zhiling Ye\",\"Hansong Xiao\",\"Yefei Chen\",\"Hualei Zhou\",\"Yun Yue\",\"Minghui Yang\",\"Chunxiao Guo\",\"Junwei Liu\",\"Peng Wei\",\"Jinjie Gu\"]","published":"2025-11-17T12:06:30Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
