{"ID":3084640,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-06T19:15:30.205453645Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05367","arxiv_id":"2606.05367","title":"Task-Vector Arithmetic for Emotional Expressivity Control in Language-Model-Based Text-to-Speech","abstract":"We investigate whether task-vector arithmetic, successful for cross-speaker emotional intensity control in modular text-to-speech (TTS), transfers to large-scale TTS systems built on language-model backbones with in-context learning (LM-TTS). Through a systematic elimination study over four progressively narrower operands on Qwen3-TTS-12Hz-1.7B - model weights via LoRA fine-tuning, continuous codec embeddings, discrete codec tokens, and the speaker embedding (x-vector) produced by an ECAPA-TDNN encoder jointly trained with the synthesis backbone - we localize the dominant carrier of emotional prosody to the x-vector. Building on this finding, we propose a training-free method based on centroid arithmetic in x-vector space: an emotion direction $τ= \\mathbb{E}_i[x(s_i,\\text{emo})] -\\mathbb{E}_i[x(s_i,\\text{neutral})]$ applied to an unseen target speaker as $x_{\\text{new}} = x(\\text{target},\\text{neutral}) + α\\cdotτ$. Using ESD (English) as the $τ$ source and emoUERJ (Brazilian Portuguese) as a cross-lingual ground-truth target, we observe average gains of $+0.29$ in emotion2vec cosine over the ICL baseline on English held-out speakers and $+0.09$ on Brazilian Portuguese held-out speakers, while largely preserving identity (WavLM SECS $\\gtrsim 0.88$ for the multi-speaker $τ$ variant) and intelligibility (WER $\\approx 0$ in PT-BR). These results offer initial evidence that the reported incompatibility of centroid-arithmetic style control with token-based TTS architectures may be circumvented when the arithmetic operates on the speaker embedding.","short_abstract":"We investigate whether task-vector arithmetic, successful for cross-speaker emotional intensity control in modular text-to-speech (TTS), transfers to large-scale TTS systems built on language-model backbones with in-context learning (LM-TTS). Through a systematic elimination study over four progressively narrower opera...","url_abs":"https://arxiv.org/abs/2606.05367","url_pdf":"https://arxiv.org/pdf/2606.05367v1","authors":"[\"Daniel Oliveira de Brito\",\"Arnaldo Candido Junior\"]","published":"2026-06-03T19:15:28Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"eess.AS\"]","methods":"[\"LoRA\"]","has_code":false}
