{"ID":2827771,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.17077","arxiv_id":"2512.17077","title":"Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving","abstract":"Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to Autoregressive Models (ARMs), utilizing parallel decoding to overcome sequential bottlenecks. However, existing research focuses primarily on kernel-level optimizations, lacking a holistic serving framework that addresses the unique memory dynamics of diffusion processes in production. We identify a critical \"memory footprint crisis\" specific to dLLMs, driven by monolithic logit tensors and the severe resource oscillation between compute-bound \"Refresh\" phases and bandwidth-bound \"Reuse\" phases. To bridge this gap, we present dLLM-Serve, an efficient dLLM serving system that co-optimizes memory footprint, computational scheduling, and generation quality. dLLM-Serve introduces Logit-Aware Activation Budgeting to decompose transient tensor peaks, a Phase-Multiplexed Scheduler to interleave heterogeneous request phases, and Head-Centric Sparse Attention to decouple logical sparsity from physical storage. We evaluate dLLM-Serve on diverse workloads (LiveBench, Burst, OSC) and GPUs (RTX 4090, L40S). Relative to the state-of-the-art baseline, dLLM-Serve improves throughput by 1.61$\\times$-1.81$\\times$ on the consumer-grade RTX 4090 and 1.60$\\times$-1.74$\\times$ on the server-grade NVIDIA L40S, while reducing tail latency by nearly 4$\\times$ under heavy contention. dLLM-Serve establishes the first blueprint for scalable dLLM inference, converting theoretical algorithmic sparsity into tangible wall-clock acceleration across heterogeneous hardware. The code is available at https://github.com/chosen-ox/dLLM-Serve.","short_abstract":"Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to Autoregressive Models (ARMs), utilizing parallel decoding to overcome sequential bottlenecks. However, existing research focuses primarily on kernel-level optimizations, lacking a holistic serving framework that addresses the unique memo...","url_abs":"https://arxiv.org/abs/2512.17077","url_pdf":"https://arxiv.org/pdf/2512.17077v2","authors":"[\"Jiakun Fan\",\"Yanglin Zhang\",\"Xiangchen Li\",\"Dimitrios S. Nikolopoulos\"]","published":"2025-12-18T21:18:15Z","proceeding":"cs.DC","tasks":"[\"cs.DC\"]","methods":"[\"Diffusion Model\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":605829,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2827771,"paper_url":"https://arxiv.org/abs/2512.17077","paper_title":"Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving","repo_url":"https://github.com/chosen-ox/dLLM-Serve","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
