{"ID":2830661,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.09504","arxiv_id":"2512.09504","title":"DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance","abstract":"Controllable text-to-speech (TTS) systems face significant challenges in achieving independent manipulation of speaker timbre and speaking style, often suffering from entanglement between these attributes. We present DMP-TTS, a latent Diffusion Transformer (DiT) framework with explicit disentanglement and multi-modal prompting. A CLAP-based style encoder (Style-CLAP) aligns cues from reference audio and descriptive text in a shared space and is trained with contrastive learning plus multi-task supervision on style attributes. For fine-grained control during inference, we introduce chained classifier-free guidance (cCFG) trained with hierarchical condition dropout, enabling independent adjustment of content, timbre, and style guidance strengths. Additionally, we employ Representation Alignment (REPA) to distill acoustic-semantic features from a pretrained Whisper model into intermediate DiT representations, stabilizing training and accelerating convergence. Experiments show that DMP-TTS delivers stronger style controllability than open-source baselines while maintaining competitive intelligibility and naturalness. Code and demos will be available at https://y61329697.github.io/DMP-TTS/.","short_abstract":"Controllable text-to-speech (TTS) systems face significant challenges in achieving independent manipulation of speaker timbre and speaking style, often suffering from entanglement between these attributes. We present DMP-TTS, a latent Diffusion Transformer (DiT) framework with explicit disentanglement and multi-modal p...","url_abs":"https://arxiv.org/abs/2512.09504","url_pdf":"https://arxiv.org/pdf/2512.09504v1","authors":"[\"Kang Yin\",\"Chunyu Qiang\",\"Sirui Zhao\",\"Xiaopeng Wang\",\"Yuzhe Liang\",\"Pengfei Cai\",\"Tong Xu\",\"Chen Zhang\",\"Enhong Chen\"]","published":"2025-12-10T10:28:18Z","proceeding":"cs.SD","tasks":"[\"cs.SD\"]","methods":"[\"Diffusion Model\",\"Transformer\"]","has_code":false}
