{"ID":2826116,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.19090","arxiv_id":"2512.19090","title":"JoyVoice: Long-Context Conditioning for Anthropomorphic Multi-Speaker Conversational Synthesis","abstract":"Large speech generation models are evolving from single-speaker, short sentence synthesis to multi-speaker, long conversation geneartion. Current long-form speech generation models are predominately constrained to dyadic, turn-based interactions. To address this, we introduce JoyVoice, a novel anthropomorphic foundation model designed for flexible, boundary-free synthesis of up to eight speakers. Unlike conventional cascaded systems, JoyVoice employs a unified E2E-Transformer-DiT architecture that utilizes autoregressive hidden representations directly for diffusion inputs, enabling holistic end-to-end optimization. We further propose a MM-Tokenizer operating at a low bitrate of 12.5 Hz, which integrates multitask semantic and MMSE losses to effectively model both semantic and acoustic information. Additionally, the model incorporates robust text front-end processing via large-scale data perturbation. Experiments show that JoyVoice achieves state-of-the-art results in multilingual generation (Chinese, English, Japanese, Korean) and zero-shot voice cloning. JoyVoice achieves top-tier results on both the Seed-TTS-Eval Benchmark and multi-speaker long-form conversational voice cloning tasks, demonstrating superior audio quality and generalization. It achieves significant improvements in prosodic continuity for long-form speech, rhythm richness in multi-speaker conversations, paralinguistic naturalness, besides superior intelligibility. We encourage readers to listen to the demo at https://jea-speech.github.io/JoyVoice","short_abstract":"Large speech generation models are evolving from single-speaker, short sentence synthesis to multi-speaker, long conversation geneartion. Current long-form speech generation models are predominately constrained to dyadic, turn-based interactions. To address this, we introduce JoyVoice, a novel anthropomorphic foundatio...","url_abs":"https://arxiv.org/abs/2512.19090","url_pdf":"https://arxiv.org/pdf/2512.19090v1","authors":"[\"Fan Yu\",\"Tao Wang\",\"You Wu\",\"Lin Zhu\",\"Wei Deng\",\"Weisheng Han\",\"Wenchao Wang\",\"Lin Hu\",\"Xiangyu Liang\",\"Xiaodong He\",\"Yankun Huang\",\"Yu Gu\",\"Yuan Liu\",\"Yuxuan Wang\",\"Zhangyu Xiao\",\"Ziteng Wang\",\"Boya Dong\",\"Feng Dang\",\"Jinming Chen\",\"Jingdong Li\",\"Jun Wang\",\"Yechen Jin\",\"Yuan Zhang\",\"Zhengyan Sheng\",\"Xin Wang\"]","published":"2025-12-22T07:00:05Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"eess.AS\"]","methods":"[\"Diffusion Model\",\"Transformer\"]","has_code":false}
