{"ID":2822906,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2601.01568","arxiv_id":"2601.01568","title":"MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning","abstract":"Joint audio-video generation aims to synthesize synchronized multisensory content, yet current unified models struggle with fine-grained acoustic control, particularly for identity-preserving speech. Existing approaches either suffer from temporal misalignment due to cascaded generation or lack the capability to perform zero-shot voice cloning within a joint synthesis framework. In this work, we present MM-Sonate, a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities. Unlike prior works that rely on coarse semantic descriptions, MM-Sonate utilizes a unified instruction-phoneme input to enforce strict linguistic and temporal alignment. To enable zero-shot voice cloning, we introduce a timbre injection mechanism that effectively decouples speaker identity from linguistic content. Furthermore, addressing the limitations of standard classifier-free guidance in multimodal settings, we propose a noise-based negative conditioning strategy that utilizes natural noise priors to significantly enhance acoustic fidelity. Empirical evaluations demonstrate that MM-Sonate establishes new state-of-the-art performance in joint generation benchmarks, significantly outperforming baselines in lip synchronization and speech intelligibility, while achieving voice cloning fidelity comparable to specialized Text-to-Speech systems.","short_abstract":"Joint audio-video generation aims to synthesize synchronized multisensory content, yet current unified models struggle with fine-grained acoustic control, particularly for identity-preserving speech. Existing approaches either suffer from temporal misalignment due to cascaded generation or lack the capability to perfor...","url_abs":"https://arxiv.org/abs/2601.01568","url_pdf":"https://arxiv.org/pdf/2601.01568v2","authors":"[\"Chunyu Qiang\",\"Jun Wang\",\"Xiaopeng Wang\",\"Kang Yin\",\"Yuxin Guo\"]","published":"2026-01-04T15:26:15Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.AI\",\"cs.CV\",\"cs.MM\",\"eess.AS\"]","methods":"[]","has_code":false}