{"ID":2886606,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.03955","arxiv_id":"2508.03955","title":"Scaling Up Audio-Synchronized Visual Animation: An Efficient Training Paradigm","abstract":"Recent advances in audio-synchronized visual animation enable control of video content using audios from specific classes. However, existing methods rely heavily on expensive manual curation of high-quality, class-specific training videos, posing challenges to scaling up to diverse audio-video classes in the open world. In this work, we propose an efficient two-stage training paradigm to scale up audio-synchronized visual animation using abundant but noisy videos. In stage one, we automatically curate large-scale videos for pretraining, allowing the model to learn diverse but imperfect audio-video alignments. In stage two, we finetune the model on manually curated high-quality examples, but only at a small scale, significantly reducing the required human effort. We further enhance synchronization by allowing each frame to access rich audio context via multi-feature conditioning and window attention. To efficiently train the model, we leverage pretrained text-to-video generator and audio encoders, introducing only 1.9\\% additional trainable parameters to learn audio-conditioning capability without compromising the generator's prior knowledge. For evaluation, we introduce AVSync48, a benchmark with videos from 48 classes, which is 3$\\times$ more diverse than previous benchmarks. Extensive experiments show that our method significantly reduces reliance on manual curation by over 10$\\times$, while generalizing to many open classes.","short_abstract":"Recent advances in audio-synchronized visual animation enable control of video content using audios from specific classes. However, existing methods rely heavily on expensive manual curation of high-quality, class-specific training videos, posing challenges to scaling up to diverse audio-video classes in the open world...","url_abs":"https://arxiv.org/abs/2508.03955","url_pdf":"https://arxiv.org/pdf/2508.03955v1","authors":"[\"Lin Zhang\",\"Zefan Cai\",\"Yufan Zhou\",\"Shentong Mo\",\"Jinhong Lin\",\"Cheng-En Wu\",\"Yibing Wei\",\"Yijing Zhang\",\"Ruiyi Zhang\",\"Wen Xiao\",\"Tong Sun\",\"Junjie Hu\",\"Pedro Morgado\"]","published":"2025-08-05T22:44:36Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false}