{"ID":2838056,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.05126","arxiv_id":"2512.05126","title":"SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model","abstract":"Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks.","short_abstract":"Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-au...","url_abs":"https://arxiv.org/abs/2512.05126","url_pdf":"https://arxiv.org/pdf/2512.05126v1","authors":"[\"Kaidi Wang\",\"Yi He\",\"Wenhao Guan\",\"Weijie Wu\",\"Hongwu Ding\",\"Xiong Zhang\",\"Di Wu\",\"Meng Meng\",\"Jian Luan\",\"Lin Li\",\"Qingyang Hong\"]","published":"2025-11-23T16:51:05Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.AI\",\"cs.CL\",\"cs.CV\",\"cs.MM\",\"cs.SD\"]","methods":"[]","has_code":false}
