{"ID":2866420,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.19852","arxiv_id":"2509.19852","title":"Eliminating stability hallucinations in llm-based tts models via attention guidance","abstract":"This paper focuses on resolving stability hallucinations (e.g., repetitive or omitted speech) in LLM-based Text-to-Speech (TTS) models by improving and leveraging the attention mechanism. First, we analyzed the alignment mechanism between text tokens and speech tokens in LLMs. We then proposed a metric termed the Optimal Alignment Score (OAS), which employs the Viterbi algorithm to evaluate text-speech alignment quality. Subsequently, OAS was integrated into the training of CosyVoice2 to assist LLMs in learning continuous, stable alignment. Additionally, the pre-trained attention value is employed to guide the training of the student CosyVoice2 via chain-of-thought (CoT), which further reduces stability hallucinations in synthesized speech. Experiments on the Seed-TTS-Eval and CV3-Eval test sets demonstrate that the proposed methods can effectively reduce the stability hallucinations of CosyVoice2 without introducing additional negative effects. The appendix is available at https://wsmzzz.github.io/llm_attn.","short_abstract":"This paper focuses on resolving stability hallucinations (e.g., repetitive or omitted speech) in LLM-based Text-to-Speech (TTS) models by improving and leveraging the attention mechanism. First, we analyzed the alignment mechanism between text tokens and speech tokens in LLMs. We then proposed a metric termed the Optim...","url_abs":"https://arxiv.org/abs/2509.19852","url_pdf":"https://arxiv.org/pdf/2509.19852v2","authors":"[\"ShiMing Wang\",\"ZhiHao Du\",\"Yang Xiang\",\"TianYu Zhao\",\"Han Zhao\",\"Qian Chen\",\"XianGang Li\",\"HanJie Guo\",\"ZhenHua Ling\"]","published":"2025-09-24T07:47:52Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false}
