{"ID":2829478,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.12196","arxiv_id":"2512.12196","title":"AutoMV: An Automatic Multi-Agent System for Music Video Generation","abstract":"Music-to-Video (M2V) generation for full-length songs faces significant challenges. Existing methods produce short, disjointed clips, failing to align visuals with musical structure, beats, or lyrics, and lack temporal consistency. We propose AutoMV, a multi-agent system that generates full music videos (MVs) directly from a song. AutoMV first applies music processing tools to extract musical attributes, such as structure, vocal tracks, and time-aligned lyrics, and constructs these features as contextual inputs for following agents. The screenwriter Agent and director Agent then use this information to design short script, define character profiles in a shared external bank, and specify camera instructions. Subsequently, these agents call the image generator for keyframes and different video generators for \"story\" or \"singer\" scenes. A Verifier Agent evaluates their output, enabling multi-agent collaboration to produce a coherent longform MV. To evaluate M2V generation, we further propose a benchmark with four high-level categories (Music Content, Technical, Post-production, Art) and twelve ine-grained criteria. This benchmark was applied to compare commercial products, AutoMV, and human-directed MVs with expert human raters: AutoMV outperforms current baselines significantly across all four categories, narrowing the gap to professional MVs. Finally, we investigate using large multimodal models as automatic MV judges; while promising, they still lag behind human expert, highlighting room for future work.","short_abstract":"Music-to-Video (M2V) generation for full-length songs faces significant challenges. Existing methods produce short, disjointed clips, failing to align visuals with musical structure, beats, or lyrics, and lack temporal consistency. We propose AutoMV, a multi-agent system that generates full music videos (MVs) directly...","url_abs":"https://arxiv.org/abs/2512.12196","url_pdf":"https://arxiv.org/pdf/2512.12196v1","authors":"[\"Xiaoxuan Tang\",\"Xinping Lei\",\"Chaoran Zhu\",\"Shiyun Chen\",\"Ruibin Yuan\",\"Yizhi Li\",\"Changjae Oh\",\"Ge Zhang\",\"Wenhao Huang\",\"Emmanouil Benetos\",\"Yang Liu\",\"Jiaheng Liu\",\"Yinghao Ma\"]","published":"2025-12-13T05:53:50Z","proceeding":"cs.MM","tasks":"[\"cs.MM\",\"cs.CV\",\"cs.SD\",\"eess.AS\"]","methods":"[]","has_code":false}
