{"ID":2851035,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.20286","arxiv_id":"2510.20286","title":"UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning","abstract":"GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. All code and model checkpoints will be publicly released in https://github.com/alibaba/UI-Ins.","short_abstract":"GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of exist...","url_abs":"https://arxiv.org/abs/2510.20286","url_pdf":"https://arxiv.org/pdf/2510.20286v1","authors":"[\"Liangyu Chen\",\"Hanzhang Zhou\",\"Chenglin Cai\",\"Jianan Zhang\",\"Panrong Tong\",\"Quyu Kong\",\"Xu Zhang\",\"Chen Liu\",\"Yuqi Liu\",\"Wenxuan Wang\",\"Yue Wang\",\"Qin Jin\",\"Steven Hoi\"]","published":"2025-10-23T07:18:32Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\"]","has_code":false,"code_links":[{"ID":607861,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2851035,"paper_url":"https://arxiv.org/abs/2510.20286","paper_title":"UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning","repo_url":"https://github.com/alibaba/UI-Ins","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
