{"ID":2884549,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.09192","arxiv_id":"2508.09192","title":"Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing","abstract":"Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs. We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than $\\mathbf{2.5\\times}$ inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to vanilla dLLMs like LLaDA and Dream, the acceleration can be more than $\\mathbf{50\\times}$ while maintaining comparable output quality. The code is available at https://github.com/zhijie-group/Discrete-Diffusion-Forcing.","short_abstract":"Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This p...","url_abs":"https://arxiv.org/abs/2508.09192","url_pdf":"https://arxiv.org/pdf/2508.09192v1","authors":"[\"Xu Wang\",\"Chenkai Xu\",\"Yijie Jin\",\"Jiachun Jin\",\"Hao Zhang\",\"Zhijie Deng\"]","published":"2025-08-08T04:51:37Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Diffusion Model\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":611087,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2884549,"paper_url":"https://arxiv.org/abs/2508.09192","paper_title":"Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing","repo_url":"https://github.com/zhijie-group/Discrete-Diffusion-Forcing","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}