{"ID":2869244,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.14784","arxiv_id":"2509.14784","title":"MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis","abstract":"This work introduces MELA-TTS, a novel joint transformer-diffusion framework for end-to-end text-to-speech synthesis. By autoregressively generating continuous mel-spectrogram frames from linguistic and speaker conditions, our architecture eliminates the need for speech tokenization and multi-stage processing pipelines. To address the inherent difficulties of modeling continuous features, we propose a representation alignment module that aligns output representations of the transformer decoder with semantic embeddings from a pretrained ASR encoder during training. This mechanism not only speeds up training convergence, but also enhances cross-modal coherence between the textual and acoustic domains. Comprehensive experiments demonstrate that MELA-TTS achieves state-of-the-art performance across multiple evaluation metrics while maintaining robust zero-shot voice cloning capabilities, in both offline and streaming synthesis modes. Our results establish a new benchmark for continuous feature generation approaches in TTS, offering a compelling alternative to discrete-token-based paradigms.","short_abstract":"This work introduces MELA-TTS, a novel joint transformer-diffusion framework for end-to-end text-to-speech synthesis. By autoregressively generating continuous mel-spectrogram frames from linguistic and speaker conditions, our architecture eliminates the need for speech tokenization and multi-stage processing pipelines...","url_abs":"https://arxiv.org/abs/2509.14784","url_pdf":"https://arxiv.org/pdf/2509.14784v2","authors":"[\"Keyu An\",\"Zhiyu Zhang\",\"Changfeng Gao\",\"Yabin Li\",\"Zhendong Peng\",\"Haoxu Wang\",\"Zhihao Du\",\"Han Zhao\",\"Zhifu Gao\",\"Xiangang Li\"]","published":"2025-09-18T09:35:15Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[\"Diffusion Model\",\"Transformer\"]","has_code":false}
