{"ID":2854613,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.14628","arxiv_id":"2510.14628","title":"RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis","abstract":"Recent advances in Text-To-Speech (TTS) synthesis have achieved near-human speech quality in neutral speaking styles. However, most existing approaches either depend on costly emotion annotations or optimize surrogate objectives that fail to adequately capture perceptual emotional quality. As a result, the generated speech, while semantically accurate, often lacks expressive and emotionally rich characteristics. To address these limitations, we propose RLAIF-SPA, a novel framework that integrates Reinforcement Learning from AI Feedback (RLAIF) to directly optimize both emotional expressiveness and intelligibility without human supervision. Specifically, RLAIF-SPA incorporates Automatic Speech Recognition (ASR) to provide semantic accuracy feedback, while leveraging structured reward modeling to evaluate prosodic-emotional consistency. RLAIF-SPA enables more precise and nuanced control over expressive speech generation along four structured evaluation dimensions: Structure, Emotion, Speed, and Tone. Extensive experiments on Libri-Speech, MELD, and Mandarin ESD datasets demonstrate consistent gains across clean read speech, conversational dialogue, and emotional speech. On Libri-Speech, RLAIF-SPA consistently outperforms Chat-TTS, achieving a 26.1% reduction in word error rate, a 9.1% improvement in SIM-O, and over 10% gains in human subjective evaluations.","short_abstract":"Recent advances in Text-To-Speech (TTS) synthesis have achieved near-human speech quality in neutral speaking styles. However, most existing approaches either depend on costly emotion annotations or optimize surrogate objectives that fail to adequately capture perceptual emotional quality. As a result, the generated sp...","url_abs":"https://arxiv.org/abs/2510.14628","url_pdf":"https://arxiv.org/pdf/2510.14628v2","authors":"[\"Qing Yang\",\"Zhenghao Liu\",\"Yangfan Du\",\"Pengcheng Huang\",\"Tong Xiao\"]","published":"2025-10-16T12:40:37Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\"]","has_code":false}
