{"ID":2862880,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.26432","arxiv_id":"2509.26432","title":"AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size","abstract":"Diffusion-based large language models (dLLMs) are gaining attention for their inherent capacity for parallel decoding, offering a compelling alternative to autoregressive LLMs. Among various decoding strategies, block-wise semi-autoregressive (semi-AR) approaches are widely adopted due to their support for KV caching and their favorable accuracy-speed trade-off. However, this paper identifies two fundamental limitations in the conventional semi-AR decoding approach that applies a fixed block size: i) late decoding overhead, where the unmasking of high-confidence tokens outside the current block is unnecessarily delayed, and ii) premature decoding error, where low-confidence tokens inside the current block are committed too early, leading to incorrect tokens. This paper presents the first systematic investigation challenging the fixed block size setting in semi-AR decoding. Through a statistical analysis of confidence dynamics during the denoising process, we identify a volatility band (VB) region during dLLM decoding, which encodes local semantic structure and can be used to guide adaptive block sizing. Leveraging these insights, we introduce AdaBlock-dLLM, a training-free, plug-and-play scheduler that adaptively aligns block boundaries with semantic steps by adjusting block size during runtime. Extensive experiments across diverse benchmarks show that AdaBlock-dLLM achieves up to 5.3% accuracy improvement under the same throughput budget. Beyond inference-time optimization, we hope our semantics-aware adaptive scheduling approach and confidence-based analysis will inspire future training strategies for dLLMs. Our code is available at https://github.com/lgxi24/AdaBlock-dLLM.","short_abstract":"Diffusion-based large language models (dLLMs) are gaining attention for their inherent capacity for parallel decoding, offering a compelling alternative to autoregressive LLMs. Among various decoding strategies, block-wise semi-autoregressive (semi-AR) approaches are widely adopted due to their support for KV caching a...","url_abs":"https://arxiv.org/abs/2509.26432","url_pdf":"https://arxiv.org/pdf/2509.26432v3","authors":"[\"Guanxi Lu\",\"Hao Mark Chen\",\"Yuto Karashima\",\"Zhican Wang\",\"Daichi Fujiki\",\"Hongxiang Fan\"]","published":"2025-09-30T15:53:56Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Diffusion Model\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":608955,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2862880,"paper_url":"https://arxiv.org/abs/2509.26432","paper_title":"AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size","repo_url":"https://github.com/lgxi24/AdaBlock-dLLM","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
