{"ID":2840440,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.13087","arxiv_id":"2511.13087","title":"MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements","abstract":"Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot pipelines that lack modularity and fail under visual clutter and ambiguous instructions. We introduce MEGA-GUI, a multi-stage framework that separates grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding, orchestrated by specialized vision-language agents. MEGA-GUI features a bidirectional ROI zoom algorithm that mitigates spatial dilution and a context-aware rewriting agent that reduces semantic ambiguity. Our analysis reveals complementary strengths and weaknesses across vision-language models at different visual scales, and we show that leveraging this modular structure achieves consistently higher accuracy than monolithic approaches. On the visually dense ScreenSpot-Pro benchmark, MEGA-GUI attains 73.18% accuracy, and on the semantically complex OSWorld-G benchmark it reaches 68.63%, surpassing previously reported results. Code and the Grounding Benchmark Toolkit (GBT) are available at https://github.com/samsungsds-research-papers/mega-gui.","short_abstract":"Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot pipelines that lack modularity and fail under visual clutter and ambiguous instru...","url_abs":"https://arxiv.org/abs/2511.13087","url_pdf":"https://arxiv.org/pdf/2511.13087v1","authors":"[\"SeokJoo Kwak\",\"Jihoon Kim\",\"Boyoun Kim\",\"Jung Jae Yoon\",\"Wooseok Jang\",\"Jeonghoon Hong\",\"Jaeho Yang\",\"Yeong-Dae Kwon\"]","published":"2025-11-17T07:38:05Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":606968,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2840440,"paper_url":"https://arxiv.org/abs/2511.13087","paper_title":"MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements","repo_url":"https://github.com/samsungsds-research-papers/mega-gui","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}