{"ID":2882541,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.11074","arxiv_id":"2508.11074","title":"LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters","abstract":"Generating high-quality and temporally synchronized audio from video content is essential for video editing and post-production tasks, enabling the creation of semantically aligned audio for silent videos. However, most existing approaches focus on short-form audio generation for video segments under 10 seconds or rely on noisy datasets for long-form video-to-audio zsynthesis. To address these limitations, we introduce LD-LAudio-V1, an extension of state-of-the-art video-to-audio models and it incorporates dual lightweight adapters to enable long-form audio generation. In addition, we release a clean and human-annotated video-to-audio dataset that contains pure sound effects without noise or artifacts. Our method significantly reduces splicing artifacts and temporal inconsistencies while maintaining computational efficiency. Compared to direct fine-tuning with short training videos, LD-LAudio-V1 achieves significant improvements across multiple metrics: $FD_{\\text{passt}}$ 450.00 $\\rightarrow$ 327.29 (+27.27%), $FD_{\\text{panns}}$ 34.88 $\\rightarrow$ 22.68 (+34.98%), $FD_{\\text{vgg}}$ 3.75 $\\rightarrow$ 1.28 (+65.87%), $KL_{\\text{panns}}$ 2.49 $\\rightarrow$ 2.07 (+16.87%), $KL_{\\text{passt}}$ 1.78 $\\rightarrow$ 1.53 (+14.04%), $IS_{\\text{panns}}$ 4.17 $\\rightarrow$ 4.30 (+3.12%), $IB_{\\text{score}}$ 0.25 $\\rightarrow$ 0.28 (+12.00%), $Energy\\Delta10\\text{ms}$ 0.3013 $\\rightarrow$ 0.1349 (+55.23%), $Energy\\Delta10\\text{ms(vs.GT)}$ 0.0531 $\\rightarrow$ 0.0288 (+45.76%), and $Sem.\\,Rel.$ 2.73 $\\rightarrow$ 3.28 (+20.15%). Our dataset aims to facilitate further research in long-form video-to-audio generation and is available at https://github.com/deepreasonings/long-form-video2audio.","short_abstract":"Generating high-quality and temporally synchronized audio from video content is essential for video editing and post-production tasks, enabling the creation of semantically aligned audio for silent videos. However, most existing approaches focus on short-form audio generation for video segments under 10 seconds or rely...","url_abs":"https://arxiv.org/abs/2508.11074","url_pdf":"https://arxiv.org/pdf/2508.11074v1","authors":"[\"Haomin Zhang\",\"Kristin Qi\",\"Shuxin Yang\",\"Zihao Chen\",\"Chaofan Ding\",\"Xinhan Di\"]","published":"2025-08-14T21:11:57Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.AI\",\"cs.CV\",\"eess.AS\"]","methods":"[]","has_code":false,"code_links":[{"ID":610901,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2882541,"paper_url":"https://arxiv.org/abs/2508.11074","paper_title":"LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters","repo_url":"https://github.com/deepreasonings/long-form-video2audio","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
