{"ID":2849988,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.22588","arxiv_id":"2510.22588","title":"UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models","abstract":"Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely functional capabilities like reasoning and question answering. To address this limitation, we introduce UltraVoice, the first large-scale speech dialogue dataset engineered for multiple fine-grained speech style control. Encompassing over 830 hours of speech dialogues, UltraVoice provides instructions across six key speech stylistic dimensions: emotion, speed, volume, accent, language, and composite styles. Fine-tuning leading models such as SLAM-Omni and VocalNet on UltraVoice significantly enhances their fine-grained speech stylistic controllability without degrading core conversational abilities. Specifically, our fine-tuned models achieve improvements of 29.12-42.33% in Mean Opinion Score (MOS) and 14.61-40.09 percentage points in Instruction Following Rate (IFR) on multi-dimensional control tasks designed in the UltraVoice. Moreover, on the URO-Bench benchmark, our fine-tuned models demonstrate substantial gains in core understanding, reasoning, and conversational abilities, with average improvements of +10.84% on the Basic setting and +7.87% on the Pro setting. Furthermore, the dataset's utility extends to training controllable Text-to-Speech (TTS) models, underscoring its high quality and broad applicability for expressive speech synthesis. The complete dataset and model checkpoints are available at: https://github.com/bigai-nlco/UltraVoice.","short_abstract":"Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely functional capabilities like reasoning and question answering. To address this limitation, we introduce UltraVoice, the first large-scale...","url_abs":"https://arxiv.org/abs/2510.22588","url_pdf":"https://arxiv.org/pdf/2510.22588v1","authors":"[\"Wenming Tu\",\"Guanrou Yang\",\"Ruiqi Yan\",\"Wenxi Chen\",\"Ziyang Ma\",\"Yipeng Kang\",\"Kai Yu\",\"Xie Chen\",\"Zilong Zheng\"]","published":"2025-10-26T09:06:55Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.CL\"]","methods":"[]","has_code":false,"code_links":[{"ID":607754,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2849988,"paper_url":"https://arxiv.org/abs/2510.22588","paper_title":"UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models","repo_url":"https://github.com/bigai-nlco/UltraVoice","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
