{"ID":2883555,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.07642","arxiv_id":"2508.07642","title":"Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents","abstract":"Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. To support targeted skill training without manual data annotation, we construct a synthetic dataset pipeline that generates diverse, linguistically natural, skill-specific instruction-trajectory pairs. We then introduce a novel training-free Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and historical actions. SkillNav obtains competitive results on commonly used benchmarks and establishes state-of-the-art generalization to the GSA-R2R, a benchmark with novel instruction styles and unseen environments.","short_abstract":"Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particu...","url_abs":"https://arxiv.org/abs/2508.07642","url_pdf":"https://arxiv.org/pdf/2508.07642v4","authors":"[\"Tianyi Ma\",\"Yue Zhang\",\"Zehao Wang\",\"Parisa Kordjamshidi\"]","published":"2025-08-11T05:50:30Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\",\"cs.CV\"]","methods":"[\"Transformer\",\"Language Model\"]","has_code":false}