{"ID":2855133,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.13394","arxiv_id":"2510.13394","title":"Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models","abstract":"Spatial reasoning ability is crucial for Vision Language Models (VLMs) to support real-world applications in diverse domains including robotics, augmented reality, and autonomous navigation. Unfortunately, existing benchmarks are inadequate in assessing spatial reasoning ability, especially the \\emph{intrinsic-dynamic} spatial reasoning which is a fundamental aspect of human spatial cognition. In this paper, we propose a unified benchmark, \\textbf{Spatial-DISE}, based on a cognitively grounded taxonomy that categorizes tasks into four fundamental quadrants: \\textbf{I}ntrinsic-\\textbf{S}tatic, Intrinsic-\\textbf{D}ynamic, \\textbf{E}xtrinsic-Static, and Extrinsic-Dynamic spatial reasoning. Moreover, to address the issue of data scarcity, we develop a scalable and automated pipeline to generate diverse and verifiable spatial reasoning questions, resulting in a new \\textbf{Spatial-DISE} dataset that includes Spatial-DISE Bench (559 evaluation VQA pairs) and Spatial-DISE-12K (12K+ training VQA pairs). Our comprehensive evaluation across 28 state-of-the-art VLMs reveals that, current VLMs have a large and consistent gap to human competence, especially on multi-step multi-view spatial reasoning. Spatial-DISE offers a robust framework, valuable dataset, and clear direction for future research toward human-like spatial intelligence. Benchmark, dataset, and code will be publicly released.","short_abstract":"Spatial reasoning ability is crucial for Vision Language Models (VLMs) to support real-world applications in diverse domains including robotics, augmented reality, and autonomous navigation. Unfortunately, existing benchmarks are inadequate in assessing spatial reasoning ability, especially the \\emph{intrinsic-dynamic}...","url_abs":"https://arxiv.org/abs/2510.13394","url_pdf":"https://arxiv.org/pdf/2510.13394v3","authors":"[\"Xinmiao Huang\",\"Qisong He\",\"Zhenglin Huang\",\"Boxuan Wang\",\"Zhuoyun Li\",\"Guangliang Cheng\",\"Yi Dong\",\"Xiaowei Huang\"]","published":"2025-10-15T10:44:01Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false}
