{"ID":2848935,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.24103","arxiv_id":"2510.24103","title":"Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation","abstract":"We present MGAudio, a novel flow-based framework for open-domain video-to-audio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifier-free guidance, MGAudio enables the generative model to guide itself through a dedicated training objective designed for video-conditioned audio generation. The framework integrates three main components: (1) a scalable flow-based Transformer model, (2) a dual-role alignment mechanism where the audio-visual encoder serves both as a conditioning module and as a feature aligner to improve generation quality, and (3) a model-guided objective that enhances cross-modal coherence and audio realism. MGAudio achieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics. It also generalizes well to the challenging UnAV-100 benchmark. These results highlight model-guided dual-role alignment as a powerful and scalable paradigm for conditional video-to-audio generation. Code is available at: https://github.com/pantheon5100/mgaudio","short_abstract":"We present MGAudio, a novel flow-based framework for open-domain video-to-audio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifier-free guidance, MGAudio enables the generative model to guide itself through a d...","url_abs":"https://arxiv.org/abs/2510.24103","url_pdf":"https://arxiv.org/pdf/2510.24103v1","authors":"[\"Kang Zhang\",\"Trung X. Pham\",\"Suyeon Lee\",\"Axi Niu\",\"Arda Senocak\",\"Joon Son Chung\"]","published":"2025-10-28T06:16:47Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.AI\",\"cs.MM\",\"eess.AS\"]","methods":"[\"Transformer\"]","has_code":false,"code_links":[{"ID":607658,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2848935,"paper_url":"https://arxiv.org/abs/2510.24103","paper_title":"Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation","repo_url":"https://github.com/pantheon5100/mgaudio","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
