{"ID":2894780,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.10109","arxiv_id":"2507.10109","title":"DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis","abstract":"While recent video-to-audio (V2A) models can generate realistic background audio from visual input, they largely overlook speech, an essential part of many video soundtracks. This paper proposes a new task, video-to-soundtrack (V2ST) generation, which aims to jointly produce synchronized background audio and speech within a unified framework. To tackle V2ST, we introduce DualDub, a unified framework built on a multimodal language model that integrates a multimodal encoder, a cross-modal aligner, and dual decoding heads for simultaneous background audio and speech generation. Specifically, our proposed cross-modal aligner employs causal and non-causal attention mechanisms to improve synchronization and acoustic harmony. Besides, to handle data scarcity, we design a curriculum learning strategy that progressively builds the multimodal capability. Finally, we introduce DualBench, the first benchmark for V2ST evaluation with a carefully curated test set and comprehensive metrics. Experimental results demonstrate that DualDub achieves state-of-the-art performance, generating high-quality and well-synchronized soundtracks with both speech and background audio.","short_abstract":"While recent video-to-audio (V2A) models can generate realistic background audio from visual input, they largely overlook speech, an essential part of many video soundtracks. This paper proposes a new task, video-to-soundtrack (V2ST) generation, which aims to jointly produce synchronized background audio and speech wit...","url_abs":"https://arxiv.org/abs/2507.10109","url_pdf":"https://arxiv.org/pdf/2507.10109v1","authors":"[\"Wenjie Tian\",\"Xinfa Zhu\",\"Haohe Liu\",\"Zhixian Zhao\",\"Zihao Chen\",\"Chaofan Ding\",\"Xinhan Di\",\"Junjie Zheng\",\"Lei Xie\"]","published":"2025-07-14T09:50:53Z","proceeding":"cs.MM","tasks":"[\"cs.MM\",\"cs.SD\",\"eess.AS\"]","methods":"[\"Language Model\"]","has_code":false}