{"ID":2862440,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.25681","arxiv_id":"2509.25681","title":"dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought","abstract":"Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics. We introduce dVLA, a diffusion-based VLA that leverages a multimodal chain-of-thought to unify visual perception, language reasoning, and robotic control in a single system. dVLA jointly optimizes perception, language understanding, and action under a single diffusion objective, enabling stronger cross-modal reasoning and better generalization to novel instructions and objects. For practical deployment, we mitigate inference latency by incorporating two acceleration strategies, a prefix attention mask and KV caching, yielding up to around times speedup at test-time inference. We evaluate dVLA in both simulation and the real world: on the LIBERO benchmark, it achieves state-of-the-art performance with a 96.4% average success rate, consistently surpassing both discrete and continuous action policies; on a real Franka robot, it succeeds across a diverse task suite, including a challenging bin-picking task that requires multi-step planning, demonstrating robust real-world performance. Together, these results underscore the promise of unified diffusion frameworks for practical, high-performance VLA robotics.","short_abstract":"Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics. We introduce dVLA, a diffusion-based VLA that leverages a multimodal chain-of-thought to unify visual perception, language reasoning, and robotic control in a single system. dVLA jointly optimizes perception, language understan...","url_abs":"https://arxiv.org/abs/2509.25681","url_pdf":"https://arxiv.org/pdf/2509.25681v1","authors":"[\"Junjie Wen\",\"Minjie Zhu\",\"Jiaming Liu\",\"Zhiyuan Liu\",\"Yicun Yang\",\"Linfeng Zhang\",\"Shanghang Zhang\",\"Yichen Zhu\",\"Yi Xu\"]","published":"2025-09-30T02:36:11Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.CV\"]","methods":"[\"Diffusion Model\"]","has_code":false}
