{"ID":2898172,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.08831","arxiv_id":"2507.08831","title":"View Invariant Learning for Vision-Language Navigation in Continuous Environments","abstract":"Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most existing approaches are sensitive to viewpoint changes, i.e. variations in camera height and viewing angle. Here we introduce a more general scenario, V$^2$-VLNCE (VLNCE with Varied Viewpoints) and propose a view-invariant post-training framework, called VIL (View Invariant Learning), that makes existing navigation policies more robust to changes in camera viewpoint. VIL employs a contrastive learning framework to learn sparse and view-invariant features. We also introduce a teacher-student framework for the Waypoint Predictor Module, a standard part of VLNCE baselines, where a view-dependent teacher model distills knowledge into a view-invariant student model. We employ an end-to-end training paradigm to jointly optimize these components. Empirical results show that our method outperforms state-of-the-art approaches on V$^2$-VLNCE by 8-15\\% measured on Success Rate for two standard benchmark datasets R2R-CE and RxR-CE. Evaluation of VIL in standard VLNCE settings shows that despite being trained for varied viewpoints, VIL often still improves performance. On the harder RxR-CE dataset, our method also achieved state-of-the-art performance across all metrics. This suggests that adding VIL does not diminish the standard viewpoint performance and can serve as a plug-and-play post-training method. We further evaluate VIL for simulated camera placements derived from real robot configurations (e.g. Stretch RE-1, LoCoBot), showing consistent improvements of performance. Finally, we present a proof-of-concept real-robot evaluation in two physical environments using a panoramic RGB sensor combined with LiDAR. The code is available at https://github.com/realjoshqsun/V2-VLNCE.","short_abstract":"Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most existing approaches are sensitive to viewpoint changes, i.e. variations in camera height and viewing angle. Here we introduc...","url_abs":"https://arxiv.org/abs/2507.08831","url_pdf":"https://arxiv.org/pdf/2507.08831v4","authors":"[\"Josh Qixuan Sun\",\"Huaiyuan Weng\",\"Xiaoying Xing\",\"Chul Min Yeum\",\"Mark Crowley\"]","published":"2025-07-05T18:04:35Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.LG\",\"cs.RO\"]","methods":"[]","has_code":false,"code_links":[{"ID":612395,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2898172,"paper_url":"https://arxiv.org/abs/2507.08831","paper_title":"View Invariant Learning for Vision-Language Navigation in Continuous Environments","repo_url":"https://github.com/realjoshqsun/V2-VLNCE","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
