{"ID":2832959,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.04720","arxiv_id":"2512.04720","title":"M3-TTS: Multi-modal DiT Alignment \u0026 Mel-latent for Zero-shot High-fidelity Speech Synthesis","abstract":"Non-autoregressive (NAR) text-to-speech synthesis relies on length alignment between text sequences and audio representations, constraining naturalness and expressiveness. Existing methods depend on duration modeling or pseudo-alignment strategies that severely limit naturalness and computational efficiency. We propose M3-TTS, a concise and efficient NAR TTS paradigm based on multi-modal diffusion transformer (MM-DiT) architecture. M3-TTS employs joint diffusion transformer layers for cross-modal alignment, achieving stable monotonic alignment between variable-length text-speech sequences without pseudo-alignment requirements. Single diffusion transformer layers further enhance acoustic detail modeling. The framework integrates a mel-vae codec that provides 3* training acceleration. Experimental results on Seed-TTS and AISHELL-3 benchmarks demonstrate that M3-TTS achieves state-of-the-art NAR performance with the lowest word error rates (1.36\\% English, 1.31\\% Chinese) while maintaining competitive naturalness scores. Code and demos will be available at https://wwwwxp.github.io/M3-TTS.","short_abstract":"Non-autoregressive (NAR) text-to-speech synthesis relies on length alignment between text sequences and audio representations, constraining naturalness and expressiveness. Existing methods depend on duration modeling or pseudo-alignment strategies that severely limit naturalness and computational efficiency. We propose...","url_abs":"https://arxiv.org/abs/2512.04720","url_pdf":"https://arxiv.org/pdf/2512.04720v1","authors":"[\"Xiaopeng Wang\",\"Chunyu Qiang\",\"Ruibo Fu\",\"Zhengqi Wen\",\"Xuefei Liu\",\"Yukun Liu\",\"Yuzhe Liang\",\"Kang Yin\",\"Yuankun Xie\",\"Heng Xie\",\"Chenxing Li\",\"Chen Zhang\",\"Changsheng Li\"]","published":"2025-12-04T12:04:02Z","proceeding":"cs.SD","tasks":"[\"cs.SD\"]","methods":"[\"Diffusion Model\",\"Transformer\",\"Variational Autoencoder\"]","has_code":false}
