{"ID":2830141,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.10360","arxiv_id":"2512.10360","title":"CLASH: Collaborative Large-Small Hierarchical Framework for Continuous Vision-and-Language Navigation","abstract":"Vision-and-Language Navigation (VLN) requires robots to follow natural language instructions and navigate complex environments without prior maps. While recent vision-language large models demonstrate strong reasoning abilities, they often underperform task-specific panoramic small models in VLN tasks. To address this, we propose CLASH (Collaborative Large-Small Hierarchy), a VLN-CE framework that integrates a reactive small-model planner (RSMP) with a reflective large-model reasoner (RLMR). RSMP adopts a causal-learning-based dual-branch architecture to enhance generalization, while RLMR leverages panoramic visual prompting with chain-of-thought reasoning to support interpretable spatial understanding and navigation. We further introduce an uncertainty-aware collaboration mechanism (UCM) that adaptively fuses decisions from both models. For obstacle avoidance, in simulation, we replace the rule-based controller with a fully learnable point-goal policy, and in real-world deployment, we design a LiDAR-based clustering module for generating navigable waypoints and pair it with an online SLAM-based local controller. CLASH achieves state-of-the-art (SoTA) results (ranking 1-st) on the VLN-CE leaderboard, significantly improving SR and SPL on the test-unseen set over the previous SoTA methods. Real-world experiments demonstrate CLASH's strong robustness, validating its effectiveness in both simulation and deployment scenarios.","short_abstract":"Vision-and-Language Navigation (VLN) requires robots to follow natural language instructions and navigate complex environments without prior maps. While recent vision-language large models demonstrate strong reasoning abilities, they often underperform task-specific panoramic small models in VLN tasks. To address this,...","url_abs":"https://arxiv.org/abs/2512.10360","url_pdf":"https://arxiv.org/pdf/2512.10360v2","authors":"[\"Liuyi Wang\",\"Zongtao He\",\"Jinlong Li\",\"Ruihao Xia\",\"Mengxian Hu\",\"Chenpeng Yao\",\"Chengju Liu\",\"Yang Tang\",\"Qijun Chen\"]","published":"2025-12-11T07:20:06Z","proceeding":"cs.RO","tasks":"[\"cs.RO\"]","methods":"[]","has_code":false}