{"ID":2831992,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.06776","arxiv_id":"2512.06776","title":"From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs","abstract":"Diffusion Language Models (DLMs) enable fast generation, yet training large DLMs from scratch is costly. As a practical shortcut, adapting off-the-shelf Auto-Regressive (AR) model weights into a DLM could quickly equip the DLM with strong long-context generation capabilies. Prior \"adaptation\" attempts either modify logits or randomly grow attention masks to Full-Sequence diffusion, or simply transplant AR weights into a Block-Diffusion recipe, leaving two key questions unaddressed: where is the final destination of adaptation, and how to adapt better? For manifold benefits, we reframe the whole AR-to-DLM adaptation under the Block-Diffusion paradigm, transitioning from block size 1 to the final Block-Diffusion state. Concretely, the principled pathway of adaptation is designed as follows: we keep a context-causal path where causal attention is kept in the prefix, an efficient parallel adaptation procedure where an AR guidance is maintained, and gradual increment of the generation block size for a smoother transition. Built on these components, the adaptation is proved competitive on various models at different scales. With better adaptation, we propose NBDiff-7B that could inherit the long-context modeling and reasoning capabilities, and achieve state-of-the-art performance among the 7B-class DLMs. Codes: https://github.com/YuchuanTian/NBDiff.","short_abstract":"Diffusion Language Models (DLMs) enable fast generation, yet training large DLMs from scratch is costly. As a practical shortcut, adapting off-the-shelf Auto-Regressive (AR) model weights into a DLM could quickly equip the DLM with strong long-context generation capabilies. Prior \"adaptation\" attempts either modify log...","url_abs":"https://arxiv.org/abs/2512.06776","url_pdf":"https://arxiv.org/pdf/2512.06776v2","authors":"[\"Yuchuan Tian\",\"Yuchen Liang\",\"Shuo Zhang\",\"Yingte Shu\",\"Guangwen Yang\",\"Wei He\",\"Sibo Fang\",\"Tianyu Guo\",\"Kai Han\",\"Chao Xu\",\"Hanting Chen\",\"Xinghao Chen\",\"Yunhe Wang\"]","published":"2025-12-07T10:28:21Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Diffusion Model\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":606187,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2831992,"paper_url":"https://arxiv.org/abs/2512.06776","paper_title":"From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs","repo_url":"https://github.com/YuchuanTian/NBDiff","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}