{"ID":2852933,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.17801","arxiv_id":"2510.17801","title":"Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain","abstract":"Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge. Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control. In this work, we refer to System 2 as the embodied brain, emphasizing its role as the cognitive core for reasoning and decision-making in manipulation tasks. Given this role, systematic evaluation of the embodied brain is essential. Yet existing benchmarks emphasize execution success, or when targeting high-level reasoning, suffer from incomplete dimensions and limited task realism, offering only a partial picture of cognitive capability. To bridge this gap, we introduce RoboBench, a benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains. Motivated by the critical roles across the full manipulation pipeline, RoboBench defines five dimensions-instruction comprehension, perception reasoning, generalized planning, affordance prediction, and failure analysis-spanning 14 capabilities, 25 tasks, and 6092 QA pairs. To ensure realism, we curate datasets across diverse embodiments, attribute-rich objects, and multi-view scenes, drawing from large-scale real robotic data. For planning, RoboBench introduces an evaluation framework, MLLM-as-world-simulator. It evaluate embodied feasibility by simulating whether predicted plans can achieve critical object-state changes. Experiments on 14 MLLMs reveal fundamental limitations: difficulties with implicit instruction comprehension, spatiotemporal reasoning, cross-scenario planning, fine-grained affordance understanding, and execution failure diagnosis. RoboBench provides a comprehensive scaffold to quantify high-level cognition, and guide the development of next-generation embodied MLLMs. The project page is in https://robo-bench.github.io.","short_abstract":"Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge. Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control. In this work, we refer to System 2 as the embodied brain,...","url_abs":"https://arxiv.org/abs/2510.17801","url_pdf":"https://arxiv.org/pdf/2510.17801v1","authors":"[\"Yulin Luo\",\"Chun-Kai Fan\",\"Menghang Dong\",\"Jiayu Shi\",\"Mengdi Zhao\",\"Bo-Wen Zhang\",\"Cheng Chi\",\"Jiaming Liu\",\"Gaole Dai\",\"Rongyu Zhang\",\"Ruichuan An\",\"Kun Wu\",\"Zhengping Che\",\"Shaoxuan Xie\",\"Guocai Yao\",\"Zhongxia Zhao\",\"Pengwei Wang\",\"Guang Liu\",\"Zhongyuan Wang\",\"Tiejun Huang\",\"Shanghang Zhang\"]","published":"2025-10-20T17:59:03Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
