{"ID":2877721,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.20018","arxiv_id":"2508.20018","title":"SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control","abstract":"The rapid advancement of large vision language models (LVLMs) and agent systems has heightened interest in mobile GUI agents that can reliably translate natural language into interface operations. Existing single-agent approaches, however, remain limited by structural constraints. Although multi-agent systems naturally decouple different competencies, recent progress in multi-agent reinforcement learning (MARL) has often been hindered by inefficiency and remains incompatible with current LVLM architectures. To address these challenges, we introduce SWIRL, a staged workflow for interleaved reinforcement learning designed for multi-agent systems. SWIRL reformulates MARL into a sequence of single-agent reinforcement learning tasks, updating one agent at a time while keeping the others fixed. This formulation enables stable training and promotes efficient coordination across agents. Theoretically, we provide a stepwise safety bound, a cross-round monotonic improvement theorem, and convergence guarantees on return, ensuring robust and principled optimization. In application to mobile GUI control, SWIRL instantiates a Navigator that converts language and screen context into structured plans, and an Interactor that grounds these plans into executable atomic actions. Extensive experiments demonstrate superior performance on both high-level and low-level GUI benchmarks. Beyond GUI tasks, SWIRL also demonstrates strong capability in multi-agent mathematical reasoning, underscoring its potential as a general framework for developing efficient and robust multi-agent systems.","short_abstract":"The rapid advancement of large vision language models (LVLMs) and agent systems has heightened interest in mobile GUI agents that can reliably translate natural language into interface operations. Existing single-agent approaches, however, remain limited by structural constraints. Although multi-agent systems naturally...","url_abs":"https://arxiv.org/abs/2508.20018","url_pdf":"https://arxiv.org/pdf/2508.20018v1","authors":"[\"Quanfeng Lu\",\"Zhantao Ma\",\"Shuai Zhong\",\"Jin Wang\",\"Dahai Yu\",\"Michael K. Ng\",\"Ping Luo\"]","published":"2025-08-27T16:27:19Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\",\"cs.CV\",\"cs.MA\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false}