{"ID":2847908,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.26190","arxiv_id":"2510.26190","title":"SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level","abstract":"The evaluation of intelligibility for TTS has reached a bottleneck, as existing assessments heavily rely on word-by-word accuracy metrics such as WER, which fail to capture the complexity of real-world speech or reflect human comprehension needs. To address this, we propose Spoken-Passage Multiple-Choice Question Answering, a novel subjective approach evaluating the accuracy of key information in synthesized speech, and release SP-MCQA-Eval, an 8.76-hour news-style benchmark dataset for SP-MCQA evaluation. Our experiments reveal that low WER does not necessarily guarantee high key-information accuracy, exposing a gap between traditional metrics and practical intelligibility. SP-MCQA shows that even state-of-the-art (SOTA) models still lack robust text normalization and phonetic accuracy. This work underscores the urgent need for high-level, more life-like evaluation criteria now that many systems already excel at WER yet may fall short on real-world intelligibility.","short_abstract":"The evaluation of intelligibility for TTS has reached a bottleneck, as existing assessments heavily rely on word-by-word accuracy metrics such as WER, which fail to capture the complexity of real-world speech or reflect human comprehension needs. To address this, we propose Spoken-Passage Multiple-Choice Question Answe...","url_abs":"https://arxiv.org/abs/2510.26190","url_pdf":"https://arxiv.org/pdf/2510.26190v1","authors":"[\"Hitomi Jin Ling Tee\",\"Chaoren Wang\",\"Zijie Zhang\",\"Zhizheng Wu\"]","published":"2025-10-30T06:57:07Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.CL\",\"eess.AS\"]","methods":"[]","has_code":false}