{"ID":2827449,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.16295","arxiv_id":"2512.16295","title":"OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models","abstract":"With VLM-powered computer-using agents (CUAs) becoming increasingly capable at graphical user interface (GUI) navigation and manipulation, reliable step-level decision-making has emerged as a key bottleneck for real-world deployment. In long-horizon workflows, errors accumulate quickly and irreversible actions can cause unintended consequences, motivating critic models that assess each action before execution. While critic models offer a promising solution, their effectiveness is hindered by the lack of diverse, high-quality GUI feedback data and public critic benchmarks for step-level evaluation in computer use. To bridge these gaps, we introduce OS-Oracle that makes three core contributions: (1) a scalable data pipeline for synthesizing cross-platform GUI critic data; (2) a two-stage training paradigm combining supervised fine-tuning (SFT) and consistency-preserving group relative policy optimization (CP-GRPO); (3) OS-Critic Bench, a holistic benchmark for evaluating critic model performance across Mobile, Web, and Desktop platforms. Leveraging this framework, we curate a high-quality dataset containing 310k critic samples. The resulting critic model, OS-Oracle-7B, achieves state-of-the-art performance among open-source VLMs on OS-Critic Bench, and surpasses proprietary models on the mobile domain. Furthermore, when serving as a pre-critic, OS-Oracle-7B improves the performance of native GUI agents such as UI-TARS-1.5-7B in OSWorld and AndroidWorld environments. The code is open-sourced at https://github.com/numbmelon/OS-Oracle.","short_abstract":"With VLM-powered computer-using agents (CUAs) becoming increasingly capable at graphical user interface (GUI) navigation and manipulation, reliable step-level decision-making has emerged as a key bottleneck for real-world deployment. In long-horizon workflows, errors accumulate quickly and irreversible actions can caus...","url_abs":"https://arxiv.org/abs/2512.16295","url_pdf":"https://arxiv.org/pdf/2512.16295v1","authors":"[\"Zhenyu Wu\",\"Jingjing Xie\",\"Zehao Li\",\"Bowen Yang\",\"Qiushi Sun\",\"Zhaoyang Liu\",\"Zhoumianze Liu\",\"Yu Qiao\",\"Xiangyu Yue\",\"Zun Wang\",\"Zichen Ding\"]","published":"2025-12-18T08:29:50Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[]","has_code":false,"code_links":[{"ID":605809,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2827449,"paper_url":"https://arxiv.org/abs/2512.16295","paper_title":"OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models","repo_url":"https://github.com/numbmelon/OS-Oracle","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
