{"ID":2923504,"CreatedAt":"2026-06-02T04:05:25.881865328Z","UpdatedAt":"2026-06-04T17:36:40.748176825Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.02544","arxiv_id":"2606.02544","title":"SimSD: Simple Speculative Decoding in Diffusion Language Models","abstract":"Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) LLMs, offering faster inference through parallel or blockwise decoding. However, their masked language modeling formulation remains incompatible with standard token-level speculative decoding, one of the most effective acceleration techniques for AR models. In AR decoding, the causal mask preserves temporally valid token-level contexts, enabling a target model to verify multiple drafted tokens in a single forward pass. In contrast, dLLMs rely on mask tokens and bidirectional attention, causing the effective context to change across denoising steps and preventing direct token-level speculative verification. To bridge this gap, we propose a simple but effective speculative decoding algorithm for diffusion language models, named SimSD, which mainly adopts a plug-and-play masking strategy that equips dLLMs with temporally valid token-level contexts for speculative decoding. Our method explicitly introduces reference tokens from draft-model predictions and designs an attention mask that regulates their interaction with current-step tokens, allowing dLLMs to compute valid logits for drafted tokens in a single forward pass. This restores the key verification ability provided by causal masking in AR models while preserving the parallel decoding advantages of dLLMs. The proposed method is training-free and can be flexibly integrated with other acceleration techniques such as KV cache and blockwise decoding. Experiments on SDAR-family dLLMs across four benchmarks show that our method achieves up to 7.46x higher decoding throughput while maintaining and even improving average generation quality.","short_abstract":"Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) LLMs, offering faster inference through parallel or blockwise decoding. However, their masked language modeling formulation remains incompatible with standard token-level speculative decoding, one of the most...","url_abs":"https://arxiv.org/abs/2606.02544","url_pdf":"https://arxiv.org/pdf/2606.02544v1","authors":"[\"Junxia Cui\",\"Haotian Ye\",\"Runchu Tian\",\"Hongcan Guo\",\"Jinya Jiang\",\"Haoru Li\",\"Chaojie Ren\",\"Yiming Huang\",\"Kaijie Zhu\",\"Zhongkai Yu\",\"Kun Zhou\",\"Jingbo Shang\"]","published":"2026-06-01T17:46:46Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Diffusion Model\",\"Large Language Model\",\"Language Model\"]","has_code":false}
