{"ID":2848377,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.25150","arxiv_id":"2510.25150","title":"Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR","abstract":"Discrete audio representations are gaining traction in speech modeling due to their interpretability and compatibility with large language models, but are not always optimized for noisy or real-world environments. Building on existing works that quantize Whisper embeddings for speech-to-unit modeling, we propose disentangling semantic speech content from background noise in the latent space. Our end-to-end model separates clean speech in the form of codebook tokens, while extracting interpretable noise vectors as quantization residue which are supervised via a lightweight classifier. We show that our approach improves alignment between clean/noisy speech and text, producing speech tokens that display a high degree of noiseinvariance, and improves ASR performance. Keeping Whisper frozen, we show an 82% reduction in error rate compared to Whisper, and 35% improvement over baseline methods on the VBDemand test set. Further analyses show that the learned token space generalizes well to both seen and unseen acoustic conditions.","short_abstract":"Discrete audio representations are gaining traction in speech modeling due to their interpretability and compatibility with large language models, but are not always optimized for noisy or real-world environments. Building on existing works that quantize Whisper embeddings for speech-to-unit modeling, we propose disent...","url_abs":"https://arxiv.org/abs/2510.25150","url_pdf":"https://arxiv.org/pdf/2510.25150v1","authors":"[\"Shreyas Gopal\",\"Ashutosh Anshul\",\"Haoyang Li\",\"Yue Heng Yeo\",\"Hexin Liu\",\"Eng Siong Chng\"]","published":"2025-10-29T04:08:19Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Language Model\"]","has_code":false}
