{"ID":3006015,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-04T17:52:58.968687531Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.02742","arxiv_id":"2606.02742","title":"Consistent Yet Wrong: Evidence Insensitivity in Spatial Vision-Language Models","abstract":"Spatial reasoning is fundamental to robotics, autonomy, and embodied AI, yet modern vision-language models (VLMs) remain unreliable on metric distance queries. A common assumption is that consistent predictions across viewpoints reflect geometric grounding. We test this assumption and find the opposite: leading VLMs often produce view-invariant and consistent answers even when those answers are incorrect, indicating weak coupling between predictions and viewpoint-specific visual evidence. We introduce \\textbf{ViewDiag}, a controlled multi-view evaluation protocol built from Hypersim, ScanNet, and KITTI360, comprising 176 object-pair tracks across 80 scenes with 2--10 views per track. The protocol evaluates models along three axes: metric accuracy, distributional concentration, and a latent feature probe for internal collapse that distinguishes decision collapse from representation collapse. Across diverse models, we observe a consistent pattern of high prediction stability paired with substantial error, clustering in a regime characterized by strong consistency but low accuracy. \\noindent These results challenge the common use of cross-view consistency as a proxy for geometric understanding. Instead, we show that stable predictions may reflect prior-driven collapse rather than evidence-sensitive reasoning. ViewDiag provides a controlled benchmark and diagnostic framework for evaluating spatial VLMs beyond accuracy alone. The code and data can be found \\href{https://github.com/SDivakarBhat/Consistent_Yet_Wrong.git}{here}","short_abstract":"Spatial reasoning is fundamental to robotics, autonomy, and embodied AI, yet modern vision-language models (VLMs) remain unreliable on metric distance queries. A common assumption is that consistent predictions across viewpoints reflect geometric grounding. We test this assumption and find the opposite: leading VLMs of...","url_abs":"https://arxiv.org/abs/2606.02742","url_pdf":"https://arxiv.org/pdf/2606.02742v1","authors":"[\"S Divakar Bhat\",\"Toshihiko Yamasaki\"]","published":"2026-06-01T18:06:08Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":612745,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-03T03:09:48.883664427Z","DeletedAt":null,"paper_id":3006015,"paper_url":"https://arxiv.org/abs/2606.02742","paper_title":"Consistent Yet Wrong: Evidence Insensitivity in Spatial Vision-Language Models","repo_url":"https://github.com/SDivakarBhat/Consistent_Yet_Wrong.git","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
