{"ID":2842085,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.09995","arxiv_id":"2511.09995","title":"Time-Layer Adaptive Alignment for Speaker Similarity in Flow-Matching Based Zero-Shot TTS","abstract":"Flow-Matching (FM)-based zero-shot text-to-speech (TTS) systems exhibit high-quality speech synthesis and robust generalization capabilities. However, the speaker representation ability of such systems remains underexplored, primarily due to the lack of explicit speaker-specific supervision in the FM framework. To this end, we conduct an empirical analysis of speaker information distribution and reveal its non-uniform allocation across time steps and network layers, underscoring the need for adaptive speaker alignment. Accordingly, we propose Time-Layer Adaptive Speaker Alignment (TLA-SA), a strategy that enhances speaker consistency by jointly leveraging temporal and hierarchical variations. Experimental results show that TLA-SA substantially improves speaker similarity over baseline systems on both research- and industrial-scale datasets and generalizes well across diverse model architectures, including decoder-only language model (LM)-based and free TTS systems. A demo is provided.","short_abstract":"Flow-Matching (FM)-based zero-shot text-to-speech (TTS) systems exhibit high-quality speech synthesis and robust generalization capabilities. However, the speaker representation ability of such systems remains underexplored, primarily due to the lack of explicit speaker-specific supervision in the FM framework. To this...","url_abs":"https://arxiv.org/abs/2511.09995","url_pdf":"https://arxiv.org/pdf/2511.09995v3","authors":"[\"Haoyu Li\",\"Mingyang Han\",\"Yu Xi\",\"Dongxiao Wang\",\"Hankun Wang\",\"Haoxiang Shi\",\"Boyu Li\",\"Jun Song\",\"Bo Zheng\",\"Shuai Wang\",\"Kai Yu\"]","published":"2025-11-13T05:59:04Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[\"Language Model\"]","has_code":false}