{"ID":2873567,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.06932","arxiv_id":"2509.06932","title":"LLaDA-VLA: Vision Language Diffusion Action Models","abstract":"The rapid progress of auto-regressive vision-language models (VLMs) has inspired growing interest in vision-language-action models (VLA) for robotic manipulation. Recently, masked diffusion models, a paradigm distinct from autoregressive models, have begun to demonstrate competitive performance in text generation and multimodal applications, leading to the development of a series of diffusion-based VLMs (d-VLMs). However, leveraging such models for robot policy learning remains largely unexplored. In this work, we present LLaDA-VLA, the first Vision-Language-Diffusion-Action model built upon pretrained d-VLMs for robotic manipulation. To effectively adapt d-VLMs to robotic domain, we introduce two key designs: (1) a localized special-token classification strategy that replaces full-vocabulary classification with special action token classification, reducing adaptation difficulty; (2) a hierarchical action-structured decoding strategy that decodes action sequences hierarchically considering the dependencies within and across actions. Extensive experiments demonstrate that LLaDA-VLA significantly outperforms state-of-the-art VLAs on both simulation and real-world robots.","short_abstract":"The rapid progress of auto-regressive vision-language models (VLMs) has inspired growing interest in vision-language-action models (VLA) for robotic manipulation. Recently, masked diffusion models, a paradigm distinct from autoregressive models, have begun to demonstrate competitive performance in text generation and m...","url_abs":"https://arxiv.org/abs/2509.06932","url_pdf":"https://arxiv.org/pdf/2509.06932v2","authors":"[\"Yuqing Wen\",\"Hebei Li\",\"Kefan Gu\",\"Yucheng Zhao\",\"Tiancai Wang\",\"Xiaoyan Sun\"]","published":"2025-09-08T17:45:40Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.CV\"]","methods":"[\"Diffusion Model\",\"Language Model\"]","has_code":false}
