{"ID":2831494,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.07173","arxiv_id":"2512.07173","title":"Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration","abstract":"We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 1.1-2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.","short_abstract":"We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size,...","url_abs":"https://arxiv.org/abs/2512.07173","url_pdf":"https://arxiv.org/pdf/2512.07173v4","authors":"[\"Jucheng Shen\",\"Gaurav Sarkar\",\"Yeonju Ro\",\"Sharath Nittur Sridhar\",\"Zhangyang Wang\",\"Aditya Akella\",\"Souvik Kundu\"]","published":"2025-12-08T05:15:41Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Diffusion Model\",\"Large Language Model\",\"Language Model\"]","has_code":false}