{"ID":2851334,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.20818","arxiv_id":"2510.20818","title":"VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation","abstract":"A fundamental challenge in robot navigation lies in learning policies that generalize across diverse environments while conforming to the unique physical constraints and capabilities of a specific embodiment (e.g., quadrupeds can walk up stairs, but rovers cannot). We propose VAMOS, a hierarchical VLA that decouples semantic planning from embodiment grounding: a generalist planner learns from diverse, open-world data, while a specialist affordance model learns the robot's physical constraints and capabilities in safe, low-cost simulation. We enabled this separation by carefully designing an interface that lets a high-level planner propose candidate paths directly in image space that the affordance model then evaluates and re-ranks. Our real-world experiments show that VAMOS achieves higher success rates in both indoor and complex outdoor navigation than state-of-the-art model-based and end-to-end learning methods. We also show that our hierarchical design enables cross-embodied navigation across legged and wheeled robots and is easily steerable using natural language. Real-world ablations confirm that the specialist model is key to embodiment grounding, enabling a single high-level planner to be deployed across physically distinct wheeled and legged robots. Finally, this model significantly enhances single-robot reliability, achieving 3X higher success rates by rejecting physically infeasible plans. Website: https://vamos-vla.github.io/","short_abstract":"A fundamental challenge in robot navigation lies in learning policies that generalize across diverse environments while conforming to the unique physical constraints and capabilities of a specific embodiment (e.g., quadrupeds can walk up stairs, but rovers cannot). We propose VAMOS, a hierarchical VLA that decouples se...","url_abs":"https://arxiv.org/abs/2510.20818","url_pdf":"https://arxiv.org/pdf/2510.20818v1","authors":"[\"Mateo Guaman Castro\",\"Sidharth Rajagopal\",\"Daniel Gorbatov\",\"Matt Schmittle\",\"Rohan Baijal\",\"Octi Zhang\",\"Rosario Scalise\",\"Sidharth Talia\",\"Emma Romig\",\"Celso de Melo\",\"Byron Boots\",\"Abhishek Gupta\"]","published":"2025-10-23T17:59:45Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.AI\",\"cs.LG\"]","methods":"[]","has_code":false}