{"ID":2822898,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2601.01547","arxiv_id":"2601.01547","title":"Vision-language models lag human performance on physical dynamics and intent reasoning","abstract":"Spatial intelligence is central to embodied cognition, yet contemporary AI systems still struggle to reason about physical interactions in open-world human environments. Despite strong performance on controlled benchmarks, vision-language models often fail to jointly model physical dynamics, reference frames, and the latent human intentions that drive spatial change. We introduce Teleo-Spatial Intelligence (TSI), a reasoning capability that links spatiotemporal change to goal-directed structure. To evaluate TSI, we present EscherVerse, a large-scale open-world resource built from 11,328 real-world videos, including an 8,000-example benchmark and a 35,963-example instruction-tuning set. Across 27 state-of-the-art vision-language models and an independent analysis of first-pass human responses from 11 annotators, we identify a persistent teleo-spatial reasoning gap: the strongest proprietary model achieves 57.26\\% overall accuracy, far below first-pass human performance, which ranges from 84.81\\% to 95.14\\% with a mean of 90.62\\%. Fine-tuning on real-world, intent-aware data narrows this gap for open-weight models, but does not close it. EscherVerse provides a diagnostic testbed for purpose-aware spatial reasoning and highlights a critical gap between pattern recognition and human-level understanding in embodied AI.","short_abstract":"Spatial intelligence is central to embodied cognition, yet contemporary AI systems still struggle to reason about physical interactions in open-world human environments. Despite strong performance on controlled benchmarks, vision-language models often fail to jointly model physical dynamics, reference frames, and the l...","url_abs":"https://arxiv.org/abs/2601.01547","url_pdf":"https://arxiv.org/pdf/2601.01547v2","authors":"[\"Tianjun Gu\",\"Jingyu Gong\",\"Zhizhong Zhang\",\"Yuan Xie\",\"Lizhuang Ma\",\"Xin Tan\",\"Athanasios V\"]","published":"2026-01-04T14:42:39Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false}
