{"ID":2855122,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.13375","arxiv_id":"2510.13375","title":"DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning","abstract":"Vision-Language-Action (VLA) models have recently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spatial reasoning inherited from Vision-Language Models (VLMs). Existing VLAs rely on extensive action-data pretraining to ground VLMs in 3D space, which reduces training efficiency and is still insufficient for accurate spatial understanding. In this work, we present DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA adopts a mixture-of-transformers design that unifies a VLM, a depth transformer, and an action expert with fully shared attentions, forming an end-to-end model with enhanced spatial reasoning. Extensive evaluations in both real-world and simulated environments show that DepthVLA outperforms state-of-the-art approaches, achieving 78.5% vs. 65.0% progress in real-world tasks, 94.9% vs. 93.6% in the LIBERO simulator, and 74.8% vs. 58.8% in the Simpler simulator. Our code will be made publicly available.","short_abstract":"Vision-Language-Action (VLA) models have recently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spatial reasoning inherited from Vision-Language Models (VLMs). Existing VLAs rely on extensive...","url_abs":"https://arxiv.org/abs/2510.13375","url_pdf":"https://arxiv.org/pdf/2510.13375v1","authors":"[\"Tianyuan Yuan\",\"Yicheng Liu\",\"Chenhao Lu\",\"Zhuoguang Chen\",\"Tao Jiang\",\"Hang Zhao\"]","published":"2025-10-15T10:09:00Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Transformer\",\"Language Model\"]","has_code":false}
