{"ID":3004801,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-05T11:43:53.432517148Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.03682","arxiv_id":"2606.03682","title":"GN0: Toward a Unified Paradigm for Generation, Evaluation, and Policy Learning in Visual-Language Navigation","abstract":"Embodied navigation connects intelligent agents with the physical world and is fundamental for general robotic intelligence. Limited availability and quality of navigation data have constrained Vision-and-Language Navigation (VLN) systems' generalization and long-horizon capabilities. To address this, we curate diverse 3D scenes and develop an automated pipeline for large-scale navigation data, resulting in the GN-Matrix dataset. Building on a 3D Gaussian Splatting (3DGS) engine, we introduce a high-fidelity simulation platform supporting interactive roaming and collision-aware navigation. We further propose GN-Bench, the first BEV-based benchmark incorporating dynamic 3DGS avatars for human-robot interaction evaluation. To leverage the simulator, we develop an RL-driven navigation foundation model, Break and Establish (BAE). After supervised learning, DAgger exposes the model to rollout-induced states, breaking narrow expert-centric distributions and enabling downstream RL exploration. This unified VLN paradigm integrates map-based and map-free tasks, including instruction following, human following, and goal navigation. GN-BAE formalizes high-fidelity 3DGS-rendered Bird's Eye View representations as compact memory, unlocking latent spatial reasoning in VLMs. Extensive evaluations on GN-Bench and VLN-CE show that GN0 outperforms state-of-the-art VLN methods. Overall, GN-Matrix offers a unified framework spanning data, simulation, and learning, advancing embodied navigation in research and industrial applications.","short_abstract":"Embodied navigation connects intelligent agents with the physical world and is fundamental for general robotic intelligence. Limited availability and quality of navigation data have constrained Vision-and-Language Navigation (VLN) systems' generalization and long-horizon capabilities. To address this, we curate diverse...","url_abs":"https://arxiv.org/abs/2606.03682","url_pdf":"https://arxiv.org/pdf/2606.03682v1","authors":"[\"Xinhai Li\",\"Xiaotao Zhang\",\"Yuehao Huang\",\"Jiankun Dong\",\"Tianhang Wang\",\"Sunyao Zhou\",\"Yunzi Wu\",\"Chengnuo Sun\",\"Yunfei Ge\",\"Qizhen Weng\",\"Chi Zhang\",\"Chenjia Bai\",\"Xuelong Li\"]","published":"2026-06-02T14:05:47Z","proceeding":"cs.RO","tasks":"[\"cs.RO\"]","methods":"[\"LoRA\"]","has_code":false}
