{"ID":2832940,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.04686","arxiv_id":"2512.04686","title":"Towards Cross-View Point Correspondence in Vision-Language Models","abstract":"Cross-view correspondence is a fundamental capability for spatial understanding and embodied AI. However, it is still far from being realized in Vision-Language Models (VLMs), especially in achieving precise point-level correspondence, which is crucial for precise affordance interaction. So we propose the Cross-View Point Correspondence (CVPC) task and CrossPoint-Bench, a comprehensive benchmark with hierarchical design, inspired by the human cognitive process of \"perceive\", \"reason\", and \"correspond\". Our evaluation shows the state-of-the-art models (e.g., Gemini-2.5-Pro) still fall far behind humans, with a gap of over 54.65% in overall accuracy, exposing a challenge in transitioning from coarse-grained judgement to fine-grained coordinate prediction. To address this problem, we construct CrossPoint-378K, a dataset with 378K question-answering pairs across 900 scenes, focused on actionable affordance regions that better reflect real-world manipulation and interaction scenarios. Furthermore, we propose CroPond that trained on the CrossPoint-378K dataset. Our CroPond achieves state-of-the-art performance on CrossPoint-Bench, surpassing Gemini-2.5-Pro by 39.7% accuracy, which offers a foundation for advancing future work on cross-view correspondence. The benchmark, dataset, and model are publicly available at https://github.com/WangYipu2002/CrossPoint.","short_abstract":"Cross-view correspondence is a fundamental capability for spatial understanding and embodied AI. However, it is still far from being realized in Vision-Language Models (VLMs), especially in achieving precise point-level correspondence, which is crucial for precise affordance interaction. So we propose the Cross-View Po...","url_abs":"https://arxiv.org/abs/2512.04686","url_pdf":"https://arxiv.org/pdf/2512.04686v2","authors":"[\"Yipu Wang\",\"Yuheng Ji\",\"Yuyang Liu\",\"Enshen Zhou\",\"Ziqiang Yang\",\"Yuxuan Tian\",\"Ziheng Qin\",\"Yue Liu\",\"Huajie Tan\",\"Cheng Chi\",\"Zhiyuan Ma\",\"Daniel Dajun Zeng\",\"Xiaolong Zheng\"]","published":"2025-12-04T11:30:31Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":606287,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2832940,"paper_url":"https://arxiv.org/abs/2512.04686","paper_title":"Towards Cross-View Point Correspondence in Vision-Language Models","repo_url":"https://github.com/WangYipu2002/CrossPoint","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
