{"ID":2834156,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.03040","arxiv_id":"2512.03040","title":"Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation","abstract":"We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.","short_abstract":"We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks....","url_abs":"https://arxiv.org/abs/2512.03040","url_pdf":"https://arxiv.org/pdf/2512.03040v2","authors":"[\"Zeqi Xiao\",\"Yiwei Zhao\",\"Lingxiao Li\",\"Yushi Lan\",\"Ning Yu\",\"Rahul Garg\",\"Roshni Cooper\",\"Mohammad H. Taghavi\",\"Xingang Pan\"]","published":"2025-12-02T18:59:44Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Diffusion Model\"]","has_code":false}
