{"ID":2865237,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.22167","arxiv_id":"2509.22167","title":"Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis","abstract":"While mel-spectrograms have been widely utilized as intermediate representations in zero-shot text-to-speech (TTS), their inherent redundancy leads to inefficiency in learning text-speech alignment. Compact VAE-based latent representations have recently emerged as a stronger alternative, but they also face a fundamental optimization dilemma: higher-dimensional latent spaces improve reconstruction quality and speaker similarity, but degrade intelligibility, while lower-dimensional spaces improve intelligibility at the expense of reconstruction fidelity. To overcome this dilemma, we propose Semantic-VAE, a novel VAE framework that utilizes semantic alignment regularization in the latent space. This design alleviates the reconstruction-generation trade-off by capturing semantic structure in high-dimensional latent representations. Extensive experiments demonstrate that Semantic-VAE significantly improves synthesis quality and training efficiency. When integrated into F5-TTS, our method achieves 2.10% WER and 0.64 speaker similarity on LibriSpeech-PC, outperforming mel-based systems (2.23%, 0.60) and vanilla acoustic VAE baselines (2.65%, 0.59). We also release the code and models to facilitate further research.","short_abstract":"While mel-spectrograms have been widely utilized as intermediate representations in zero-shot text-to-speech (TTS), their inherent redundancy leads to inefficiency in learning text-speech alignment. Compact VAE-based latent representations have recently emerged as a stronger alternative, but they also face a fundamenta...","url_abs":"https://arxiv.org/abs/2509.22167","url_pdf":"https://arxiv.org/pdf/2509.22167v2","authors":"[\"Zhikang Niu\",\"Shujie Hu\",\"Jeongsoo Choi\",\"Yushen Chen\",\"Peining Chen\",\"Pengcheng Zhu\",\"Yunting Yang\",\"Bowen Zhang\",\"Jian Zhao\",\"Chunhui Wang\",\"Xie Chen\"]","published":"2025-09-26T10:27:58Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[\"Variational Autoencoder\"]","has_code":false}