{"ID":2854606,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.14621","arxiv_id":"2510.14621","title":"ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks","abstract":"The rapid advancement of multimodal large language models has enabled agents to operate mobile devices by directly interacting with graphical user interfaces, opening new possibilities for mobile automation. However, real-world mobile tasks are often complex and allow for multiple valid solutions. This contradicts current mobile agent evaluation standards: offline static benchmarks can only validate a single predefined \"golden path\", while online dynamic testing is constrained by the complexity and non-reproducibility of real devices, making both approaches inadequate for comprehensively assessing agent capabilities. To bridge the gap between offline and online evaluation and enhance testing stability, this paper introduces a novel graph-structured benchmarking framework. By modeling the finite states observed during real-device interactions, it achieves static simulation of dynamic behaviors. Building on this, we develop ColorBench, a benchmark focused on complex long-horizon tasks. It supports evaluation of multiple valid solutions, subtask completion rate statistics, and atomic-level capability analysis. ColorBench contains 175 tasks (74 single-app, 101 cross-app) with an average length of over 13 steps. Each task includes at least two correct paths and several typical error paths, enabling quasi-dynamic interaction. By evaluating ColorBench across various baselines, we discover limitations of existing models and propose improvement directions and feasible technical pathways to enhance agents' performance on complex, long-horizon problems based on experimental results. Code and data are available at: https://github.com/MadeAgents/ColorBench.","short_abstract":"The rapid advancement of multimodal large language models has enabled agents to operate mobile devices by directly interacting with graphical user interfaces, opening new possibilities for mobile automation. However, real-world mobile tasks are often complex and allow for multiple valid solutions. This contradicts curr...","url_abs":"https://arxiv.org/abs/2510.14621","url_pdf":"https://arxiv.org/pdf/2510.14621v1","authors":"[\"Yuanyi Song\",\"Heyuan Huang\",\"Qiqiang Lin\",\"Yin Zhao\",\"Xiangmou Qu\",\"Jun Wang\",\"Xingyu Lou\",\"Weiwen Liu\",\"Zhuosheng Zhang\",\"Jun Wang\",\"Yong Yu\",\"Weinan Zhang\",\"Zhaoxiang Wang\"]","published":"2025-10-16T12:30:05Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":608175,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2854606,"paper_url":"https://arxiv.org/abs/2510.14621","paper_title":"ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks","repo_url":"https://github.com/MadeAgents/ColorBench","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}