{"ID":2841409,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.12347","arxiv_id":"2511.12347","title":"VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing","abstract":"We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at https://zhishengzheng.com/voicecraft-x/.","short_abstract":"We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large...","url_abs":"https://arxiv.org/abs/2511.12347","url_pdf":"https://arxiv.org/pdf/2511.12347v1","authors":"[\"Zhisheng Zheng\",\"Puyuan Peng\",\"Anuj Diwan\",\"Cong Phuoc Huynh\",\"Xiaohang Sun\",\"Zhu Liu\",\"Vimal Bhat\",\"David Harwath\"]","published":"2025-11-15T20:27:25Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.CL\",\"cs.SD\"]","methods":"[\"Language Model\"]","project_urls":"[\"https://zhishengzheng.com/voicecraft-x/\"]","has_code":false,"code_links":[{"ID":607053,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2841409,"paper_url":"https://arxiv.org/abs/2511.12347","paper_title":"VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing","repo_url":"https://github.com/zszheng147/VoiceCraft-X","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}