{"ID":2830981,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.08186","arxiv_id":"2512.08186","title":"Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation","abstract":"While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance. We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, \"grounds slowly\" by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, \"moves fast\" by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories. The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.","short_abstract":"While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and st...","url_abs":"https://arxiv.org/abs/2512.08186","url_pdf":"https://arxiv.org/pdf/2512.08186v1","authors":"[\"Meng Wei\",\"Chenyang Wan\",\"Jiaqi Peng\",\"Xiqian Yu\",\"Yuqiang Yang\",\"Delin Feng\",\"Wenzhe Cai\",\"Chenming Zhu\",\"Tai Wang\",\"Jiangmiao Pang\",\"Xihui Liu\"]","published":"2025-12-09T02:29:36Z","proceeding":"cs.RO","tasks":"[\"cs.RO\"]","methods":"[\"Diffusion Model\",\"Transformer\",\"Language Model\"]","has_code":false}
