{"ID":2863515,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.24629","arxiv_id":"2509.24629","title":"Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis","abstract":"While emotional text-to-speech (TTS) has made significant progress, most existing research remains limited to utterance-level emotional expression and fails to support word-level control. Achieving word-level expressive control poses fundamental challenges, primarily due to the complexity of modeling multi-emotion transitions and the scarcity of annotated datasets that capture intra-sentence emotional and prosodic variation. In this paper, we propose WeSCon, the first self-training framework that enables word-level control of both emotion and speaking rate in a pretrained zero-shot TTS model, without relying on datasets containing intra-sentence emotion or speed transitions. Our method introduces a transition-smoothing strategy and a dynamic speed control mechanism to guide the pretrained TTS model in performing word-level expressive synthesis through a multi-round inference process. To further simplify the inference, we incorporate a dynamic emotional attention bias mechanism and fine-tune the model via self-training, thereby activating its ability for word-level expressive control in an end-to-end manner. Experimental results show that WeSCon effectively overcomes data scarcity, achieving state-of-the-art performance in word-level emotional expression control while preserving the strong zero-shot synthesis capabilities of the original TTS model.","short_abstract":"While emotional text-to-speech (TTS) has made significant progress, most existing research remains limited to utterance-level emotional expression and fails to support word-level control. Achieving word-level expressive control poses fundamental challenges, primarily due to the complexity of modeling multi-emotion tran...","url_abs":"https://arxiv.org/abs/2509.24629","url_pdf":"https://arxiv.org/pdf/2509.24629v2","authors":"[\"Tianrui Wang\",\"Haoyu Wang\",\"Meng Ge\",\"Cheng Gong\",\"Chunyu Qiang\",\"Ziyang Ma\",\"Zikang Huang\",\"Guanrou Yang\",\"Xiaobao Wang\",\"Eng Siong Chng\",\"Xie Chen\",\"Longbiao Wang\",\"Jianwu Dang\"]","published":"2025-09-29T11:37:39Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.SD\"]","methods":"[]","has_code":false}
