{"ID":2884438,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.07048","arxiv_id":"2508.07048","title":"Whisfusion: Parallel ASR Decoding via a Diffusion Transformer","abstract":"Fast Automatic Speech Recognition (ASR) is critical for latency-sensitive applications such as real-time captioning and meeting transcription. However, truly parallel ASR decoding remains challenging due to the sequential nature of autoregressive (AR) decoders and the context limitations of non-autoregressive (NAR) methods. While modern ASR encoders can process up to 30 seconds of audio at once, AR decoders still generate tokens sequentially, creating a latency bottleneck. We propose Whisfusion, the first framework to fuse a pre-trained Whisper encoder with a text diffusion decoder. This NAR architecture resolves the AR latency bottleneck by processing the entire acoustic context in parallel at every decoding step. A lightweight cross-attention adapter trained via parameter-efficient fine-tuning (PEFT) bridges the two modalities. We also introduce a batch-parallel, multi-step decoding strategy that improves accuracy by increasing the number of candidates with minimal impact on speed. Fine-tuned solely on LibriSpeech (960h), Whisfusion achieves a lower WER than Whisper-tiny (8.3% vs. 9.7%), and offers comparable latency on short audio. For longer utterances (\u003e20s), it is up to 2.6x faster than the AR baseline, establishing a new, efficient operating point for long-form ASR. The implementation and training scripts are available at https://github.com/taeyoun811/Whisfusion.","short_abstract":"Fast Automatic Speech Recognition (ASR) is critical for latency-sensitive applications such as real-time captioning and meeting transcription. However, truly parallel ASR decoding remains challenging due to the sequential nature of autoregressive (AR) decoders and the context limitations of non-autoregressive (NAR) met...","url_abs":"https://arxiv.org/abs/2508.07048","url_pdf":"https://arxiv.org/pdf/2508.07048v1","authors":"[\"Taeyoun Kwon\",\"Junhyuk Ahn\",\"Taegeun Yun\",\"Heeju Jwa\",\"Yoonchae Choi\",\"Siwon Park\",\"Nam-Joon Kim\",\"Jangchan Kim\",\"Hyun Gon Ryu\",\"Hyuk-Jae Lee\"]","published":"2025-08-09T17:20:54Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.AI\",\"cs.LG\",\"eess.AS\"]","methods":"[\"Diffusion Model\",\"Transformer\"]","has_code":false,"code_links":[{"ID":611080,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2884438,"paper_url":"https://arxiv.org/abs/2508.07048","paper_title":"Whisfusion: Parallel ASR Decoding via a Diffusion Transformer","repo_url":"https://github.com/taeyoun811/Whisfusion","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
