{"ID":2860203,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.04039","arxiv_id":"2510.04039","title":"\\textsc{GUI-Spotlight}: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding","abstract":"Multimodal large language models (MLLMs) have markedly expanded the competence of graphical user-interface (GUI) systems, propelling them beyond controlled simulations into complex, real-world environments across diverse platforms. However, practical usefulness is still bounded by the reliability of visual grounding, i.e., mapping textual references to exact on-screen elements. This limitation prevents the system from accurately performing pointer-level actions such as clicking or dragging. To address it, we introduce GUI-Spotlight -- a model trained for image-grounded reasoning that dynamically invokes multiple specialized tools to iteratively narrow its focus to the relevant region of the screen, thereby substantially improving visual grounding accuracy. On the ScreenSpot-Pro benchmark, GUI-Spotlight trained with only 18.5K training samples achieves 52.8\\% accuracy, surpassing V2P-7B (50.6\\% with 9.6M training samples) and GTA-1-7B (50.1\\% with 1.56M training samples).","short_abstract":"Multimodal large language models (MLLMs) have markedly expanded the competence of graphical user-interface (GUI) systems, propelling them beyond controlled simulations into complex, real-world environments across diverse platforms. However, practical usefulness is still bounded by the reliability of visual grounding, i...","url_abs":"https://arxiv.org/abs/2510.04039","url_pdf":"https://arxiv.org/pdf/2510.04039v1","authors":"[\"Bin Lei\",\"Nuo Xu\",\"Ali Payani\",\"Mingyi Hong\",\"Chunhua Liao\",\"Yu Cao\",\"Caiwen Ding\"]","published":"2025-10-05T05:15:45Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
