{"ID":2847098,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.01056","arxiv_id":"2511.01056","title":"WhisperVC: Decoupled Cross-Domain Alignment and Speech Generation for Low-Resource Whisper-to-Normal Conversion","abstract":"Whispered speech lacks vocal-fold excitation, making intelligible conversion challenging. We propose WhisperVC, a three-stage framework for low-resource whisper-to-normal (W2N) conversion that decouples cross-domain alignment from speech generation. Stage 1 uses limited paired whisper-normal data with a content encoder and a Conformer-based variational autoencoder (VAE) with soft-DTW alignment to learn domain-invariant semantic representations. Stage 2, trained only on normal speech, employs a Length-Channel Aligner and a two-stage speaker-conditioned mel generator for timbre and prosody modeling. Stage 3 fine-tunes a HiFi-GAN vocoder for waveform synthesis. Experimental results on AISHELL6-Whisper show competitive quality (DNSMOS 3.07, UTMOS 2.83, CER 16.93%) and WavLM speaker similarity (0.95). The framework also supports privacy-preserving communication as well as non-vocal communication and a rehabilitation tool for post-surgical vocal-fold patients. Samples are available online.","short_abstract":"Whispered speech lacks vocal-fold excitation, making intelligible conversion challenging. We propose WhisperVC, a three-stage framework for low-resource whisper-to-normal (W2N) conversion that decouples cross-domain alignment from speech generation. Stage 1 uses limited paired whisper-normal data with a content encoder...","url_abs":"https://arxiv.org/abs/2511.01056","url_pdf":"https://arxiv.org/pdf/2511.01056v2","authors":"[\"Dong Liu\",\"Juan Liu\",\"Wei Ju\",\"Yao Tian\",\"Ming Li\"]","published":"2025-11-02T19:18:38Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[\"Generative Adversarial Network\",\"Variational Autoencoder\"]","has_code":false}
