{"ID":2892816,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.03700","arxiv_id":"2508.03700","title":"MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning","abstract":"This paper presents MagicGUI, a foundational mobile GUI agent designed to address critical challenges in perception, grounding, and reasoning within real-world mobile GUI environments. The framework is underpinned by following six key components: (1) a comprehensive and accurate dataset, constructed via the scalable GUI Data Pipeline, which aggregates the largest and most diverse GUI-centric multimodal data to date from open-source repositories, automated crawling, and targeted manual annotation; (2) enhanced perception and grounding capabilities, facilitating fine-grained multimodal alignment for UI element referencing, grounding, and screen comprehension; (3) a comprehensive and unified action space, encompassing both fundamental UI operations and complex interactive intents to support human-agent interactions; (4) planning-oriented reasoning mechanisms that enable the model to decompose complex user instructions into sequential actions with explicit intermediate meta-paln reasoning; (5) an iterative two-stage training procedure, combining large-scale continue pre-training on 7.8M samples with reinforcement fine-tuning utilizing a spatially enhanced composite reward and dual filtering strategy; and (6) competitive performance on both the proprietary Magic-RICH benchmark and over a dozen public benchmarks, achieving superior performance across GUI perception and agent tasks, while demonstrating robust generalization and real-world deployment potential in practical mobile GUI scenarios, as detailed in Figure 1.","short_abstract":"This paper presents MagicGUI, a foundational mobile GUI agent designed to address critical challenges in perception, grounding, and reasoning within real-world mobile GUI environments. The framework is underpinned by following six key components: (1) a comprehensive and accurate dataset, constructed via the scalable GU...","url_abs":"https://arxiv.org/abs/2508.03700","url_pdf":"https://arxiv.org/pdf/2508.03700v5","authors":"[\"Liujian Tang\",\"Shaokang Dong\",\"Yijia Huang\",\"Minqi Xiang\",\"Hongtao Ruan\",\"Bin Wang\",\"Shuo Li\",\"Zhiheng Xi\",\"Zhihui Cao\",\"Hailiang Pang\",\"Heng Kong\",\"He Yang\",\"Mingxu Chai\",\"Zhilin Gao\",\"Xingyu Liu\",\"Yingnan Fu\",\"Jiaming Liu\",\"Xuanjing Huang\",\"Yu-Gang Jiang\",\"Tao Gui\",\"Qi Zhang\",\"Kang Wang\",\"Yunke Zhang\",\"Yuran Wang\"]","published":"2025-07-19T12:33:43Z","proceeding":"cs.HC","tasks":"[\"cs.HC\",\"cs.AI\"]","methods":"[]","has_code":false}
