{"ID":2866560,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.20060","arxiv_id":"2509.20060","title":"Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens","abstract":"This paper introduces a discrete diffusion model (DDM) framework for text-aligned speech tokenization and reconstruction. By replacing the auto-regressive speech decoder with a discrete diffusion counterpart, our model achieves significantly better reconstruction quality, stronger ASR performance, and faster inference. We provide a comprehensive analysis of applying DDMs to speech reconstruction, examining sampler choices, inference steps, and robustness to length-scale estimation errors. Furthermore, we improve the original TASTE by systematically comparing vector quantization modules, showing that FSQ yields up to a 35% relative WER reduction and +0.14 UT-MOS improvement over RVQ for AR models, while also enhancing DDM performance. Our model generates speech in just 10 denoising steps and even supports single-step generation with only minor quality degradation.","short_abstract":"This paper introduces a discrete diffusion model (DDM) framework for text-aligned speech tokenization and reconstruction. By replacing the auto-regressive speech decoder with a discrete diffusion counterpart, our model achieves significantly better reconstruction quality, stronger ASR performance, and faster inference....","url_abs":"https://arxiv.org/abs/2509.20060","url_pdf":"https://arxiv.org/pdf/2509.20060v1","authors":"[\"Pin-Jui Ku\",\"He Huang\",\"Jean-Marie Lemercier\",\"Subham Sekhar Sahoo\",\"Zhehuai Chen\",\"Ante Jukić\"]","published":"2025-09-24T12:29:31Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[\"Diffusion Model\"]","has_code":false}
