{"ID":2829864,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.11574","arxiv_id":"2512.11574","title":"Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis","abstract":"Benchmarking 3D spatial understanding of foundation models is essential for real-world applications such as robotics and autonomous driving. Existing evaluations often rely on downstream fine-tuning with linear heads or task-specific decoders, making it difficult to isolate the intrinsic 3D reasoning ability of pre-trained encoders. In this work, we introduce a novel benchmark for in-context 3D scene understanding that requires no fine-tuning and directly probes the quality of dense visual features. Building on the Hummingbird framework, which evaluates in-context 2D scene understanding, we extend the setup to the 3D Multi-View ImageNet (MVImgNet) dataset. Given a set of images depicting objects at specific camera angles (keys), we benchmark the performance of segmenting novel views (queries) and report the scores in 4 categories of easy, medium, hard, and extreme based on the key-query view contrast. We benchmark 7 state-of-the-art foundation models and show that DINO-based encoders remain competitive across large viewpoint shifts. Our code is publicly available at https://github.com/ToyeshC/open-hummingbird-3d-eval.","short_abstract":"Benchmarking 3D spatial understanding of foundation models is essential for real-world applications such as robotics and autonomous driving. Existing evaluations often rely on downstream fine-tuning with linear heads or task-specific decoders, making it difficult to isolate the intrinsic 3D reasoning ability of pre-tra...","url_abs":"https://arxiv.org/abs/2512.11574","url_pdf":"https://arxiv.org/pdf/2512.11574v2","authors":"[\"Valentina Lilova\",\"Toyesh Chakravorty\",\"Julian I. Bibo\",\"Emma Boccaletti\",\"Brandon Li\",\"Lívia Baxová\",\"Cees G. M. Snoek\",\"Mohammadreza Salehi\"]","published":"2025-12-12T14:03:16Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false,"code_links":[{"ID":605977,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2829864,"paper_url":"https://arxiv.org/abs/2512.11574","paper_title":"Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis","repo_url":"https://github.com/ToyeshC/open-hummingbird-3d-eval","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
