{"ID":2824982,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.21970","arxiv_id":"2512.21970","title":"StereoVLA: Enhancing Vision-Language-Action Models with Stereo Vision","abstract":"Stereo cameras closely mimic human binocular vision, providing rich spatial cues critical for precise robotic manipulation. Despite their advantage, the adoption of stereo vision in vision-language-action models (VLAs) remains underexplored. In this work, we present StereoVLA, a VLA model that leverages rich geometric cues from stereo vision. We propose a novel Geometric-Semantic Feature Extraction module that utilizes vision foundation models to extract and fuse two key features: 1) geometric features from subtle stereo-view differences for spatial perception; 2) semantic-rich features from the monocular view for instruction following. Additionally, we propose an auxiliary Interaction-Region Depth Estimation task to further enhance spatial perception and accelerate model convergence. Extensive experiments show that our approach outperforms baselines by a large margin in diverse tasks under the stereo setting and demonstrates strong robustness to camera pose variations.","short_abstract":"Stereo cameras closely mimic human binocular vision, providing rich spatial cues critical for precise robotic manipulation. Despite their advantage, the adoption of stereo vision in vision-language-action models (VLAs) remains underexplored. In this work, we present StereoVLA, a VLA model that leverages rich geometric...","url_abs":"https://arxiv.org/abs/2512.21970","url_pdf":"https://arxiv.org/pdf/2512.21970v1","authors":"[\"Shengliang Deng\",\"Mi Yan\",\"Yixin Zheng\",\"Jiayi Su\",\"Wenhao Zhang\",\"Xiaoguang Zhao\",\"Heming Cui\",\"Zhizheng Zhang\",\"He Wang\"]","published":"2025-12-26T10:34:20Z","proceeding":"cs.RO","tasks":"[\"cs.RO\"]","methods":"[]","has_code":false}