{"ID":2858271,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.08666","arxiv_id":"2510.08666","title":"dInfer: An Efficient Inference Framework for Diffusion Language Models","abstract":"Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and efficient inference framework. We present dInfer, an efficient and extensible framework for dLLM inference. dInfer decomposes the inference pipeline into four modular components--model, diffusion iteration manager, decoding strategy, and KV-cache manager--and integrates novel algorithms for each component alongside system-level optimizations. Through this combination of algorithmic innovations and system enhancements, dInfer achieves substantial efficiency gains without compromising output quality on LLaDA-MoE. At batch size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800 tokens per second across six benchmarks on $8\\times$ H800 GPUs. Compared to prior systems, dInfer delivers a $10\\times$ speedup over Fast-dLLM while maintaining similar model performance. Even compared to the AR model (with a comparable number of activation parameters and performance) QWen2.5-3B, which is highly optimized with the latest vLLM inference engine, dInfer still delivers a $2$-$3\\times$ speedup. The implementation of dInfer is open-sourced at https://github.com/inclusionAI/dInfer.","short_abstract":"Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and...","url_abs":"https://arxiv.org/abs/2510.08666","url_pdf":"https://arxiv.org/pdf/2510.08666v3","authors":"[\"Yuxin Ma\",\"Lun Du\",\"Lanning Wei\",\"Kun Chen\",\"Qian Xu\",\"Kangyu Wang\",\"Guofeng Feng\",\"Guoshan Lu\",\"Lin Liu\",\"Xiaojing Qi\",\"Xinyuan Zhang\",\"Zhen Tao\",\"Haibo Feng\",\"Ziyun Jiang\",\"Ying Xu\",\"Zenan Huang\",\"Yihong Zhuang\",\"Haokai Xu\",\"Jiaqi Hu\",\"Zhenzhong Lan\",\"Junbo Zhao\",\"Jianguo Li\",\"Da Zheng\"]","published":"2025-10-09T16:19:42Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Diffusion Model\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":608529,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2858271,"paper_url":"https://arxiv.org/abs/2510.08666","paper_title":"dInfer: An Efficient Inference Framework for Diffusion Language Models","repo_url":"https://github.com/inclusionAI/dInfer","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}