{"ID":2854729,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.14836","arxiv_id":"2510.14836","title":"QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models","abstract":"Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.","short_abstract":"Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-V...","url_abs":"https://arxiv.org/abs/2510.14836","url_pdf":"https://arxiv.org/pdf/2510.14836v2","authors":"[\"Yixuan Li\",\"Yuhui Chen\",\"Mingcai Zhou\",\"Haoran Li\",\"Zhengtao Zhang\",\"Dongbin Zhao\"]","published":"2025-10-16T16:11:18Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.RO\"]","methods":"[\"Variational Autoencoder\"]","has_code":false}