{"ID":2899406,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.02190","arxiv_id":"2507.02190","title":"cVLA: Towards Efficient Camera-Space VLAs","abstract":"Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive performance of Vision Language Models (VLMs) on 2D images to directly infer robot end-effector poses in image frame coordinates. Unlike prior VLA models that output low-level controls, our model predicts trajectory waypoints, making it both more efficient to train and robot embodiment agnostic. Despite its lightweight design, our next-token prediction architecture effectively learns meaningful and executable robot trajectories. We further explore the underutilized potential of incorporating depth images, inference-time techniques such as decoding strategies, and demonstration-conditioned action generation. Our model is trained on a simulated dataset and exhibits strong sim-to-real transfer capabilities. We evaluate our approach using a combination of simulated and real data, demonstrating its effectiveness on a real robotic system.","short_abstract":"Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive performance of Vision Language Models (VLMs) on 2D images to directly infer robot end-eff...","url_abs":"https://arxiv.org/abs/2507.02190","url_pdf":"https://arxiv.org/pdf/2507.02190v2","authors":"[\"Max Argus\",\"Jelena Bratulic\",\"Houman Masnavi\",\"Maxim Velikanov\",\"Nick Heppert\",\"Abhinav Valada\",\"Thomas Brox\"]","published":"2025-07-02T22:56:41Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false}
