{"ID":2886655,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.02013","arxiv_id":"2508.02013","title":"SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents","abstract":"Speech is essential for realistic role-playing, yet existing work on role-playing agents largely centers on text, leaving Speech Role-Playing Agents (SRPAs) underexplored and without systematic evaluation. We introduce SpeechRole, a unified framework for developing and assessing SRPAs. SpeechRole-Data contains 98 roles and 111k speech-to-speech conversations with rich timbre and prosodic variation, providing large-scale resources for training SRPAs. SpeechRole-Eval offers a multidimensional benchmark that directly evaluates generated speech, preserving paralinguistic cues and measuring interaction ability, speech expressiveness, and role-playing fidelity. Experiments show that end-to-end SRPAs such as GPT-4o Audio achieve strong fluency and naturalness, but remain limited in prosody consistency and emotion appropriateness. In contrast, current open-source end-to-end models exhibit substantial performance gaps across multiple evaluation dimensions. Cascaded and end-to-end systems achieve comparable results in interaction ability and role-playing fidelity, suggesting that these aspects are still largely influenced by the underlying text-based language models. We release all data, code, and evaluation tools at https://github.com/yuhui1038/SpeechRole.","short_abstract":"Speech is essential for realistic role-playing, yet existing work on role-playing agents largely centers on text, leaving Speech Role-Playing Agents (SRPAs) underexplored and without systematic evaluation. We introduce SpeechRole, a unified framework for developing and assessing SRPAs. SpeechRole-Data contains 98 roles...","url_abs":"https://arxiv.org/abs/2508.02013","url_pdf":"https://arxiv.org/pdf/2508.02013v7","authors":"[\"Changhao Jiang\",\"Jiajun Sun\",\"Yifei Cao\",\"Jiabao Zhuang\",\"Xinmeng Che\",\"Hui Li\",\"Xiaoran Fan\",\"Ming Zhang\",\"Junjie Ye\",\"Shihan Dou\",\"Zhiheng Xi\",\"Jingqi Tong\",\"Yilong Wu\",\"Baoyu Fan\",\"Tao Ji\",\"Tao Gui\",\"Qi Zhang\",\"Xuanjing Huang\"]","published":"2025-08-04T03:18:36Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":611328,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2886655,"paper_url":"https://arxiv.org/abs/2508.02013","paper_title":"SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents","repo_url":"https://github.com/yuhui1038/SpeechRole","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}