{"ID":2879083,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.16930","arxiv_id":"2508.16930","title":"HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation","abstract":"Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing methods, we propose HunyuanVideo-Foley, an end-to-end text-video-to-audio framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k-hour multimodal datasets through automated annotation; (2) a representation alignment strategy using self-supervised audio features to guide latent diffusion training, efficiently improving audio quality and generation stability; (3) a novel multimodal diffusion transformer resolving modal competition, containing dual-stream audio-video fusion through joint attention, and textual semantic injection via cross-attention. Comprehensive evaluations demonstrate that HunyuanVideo-Foley achieves new state-of-the-art performance across audio fidelity, visual-semantic alignment, temporal alignment and distribution matching. The demo page is available at: https://szczesnys.github.io/hunyuanvideo-foley/.","short_abstract":"Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing methods, we propose HunyuanVi...","url_abs":"https://arxiv.org/abs/2508.16930","url_pdf":"https://arxiv.org/pdf/2508.16930v1","authors":"[\"Sizhe Shan\",\"Qiulin Li\",\"Yutao Cui\",\"Miles Yang\",\"Yuehai Wang\",\"Qun Yang\",\"Jin Zhou\",\"Zhao Zhong\"]","published":"2025-08-23T07:30:18Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.CV\",\"cs.SD\"]","methods":"[\"Diffusion Model\",\"Transformer\"]","has_code":false}