{"ID":3004807,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-05T11:43:53.432517148Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.03672","arxiv_id":"2606.03672","title":"Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation","abstract":"Recent unified audio generation models can support diverse tasks across speech, sound effects, and music, but most of them still focus on isolated task-level synthesis. However, real video production often requires multiple components of a complete audio track to be generated jointly and consistently for the same video. We present Foley-Omni, a unified multimodal audio generation model that extends isolated task-level synthesis to complete video soundtrack generation by jointly modeling speech, sound effects, and music within a shared latent generation process. To support training and reproducible evaluation, we develop an audiovisual data curation pipeline and introduce V2ST-Bench, a benchmark for holistic video soundtrack generation evaluation. Experiments show that Foley-Omni achieves competitive performance with expert systems on individual synthesis tasks, while improving speech intelligibility, audiovisual consistency and perceptual quality for mixed soundtrack generation.","short_abstract":"Recent unified audio generation models can support diverse tasks across speech, sound effects, and music, but most of them still focus on isolated task-level synthesis. However, real video production often requires multiple components of a complete audio track to be generated jointly and consistently for the same video...","url_abs":"https://arxiv.org/abs/2606.03672","url_pdf":"https://arxiv.org/pdf/2606.03672v1","authors":"[\"Ye Tao\",\"Lupeng Liu\",\"Xuenan Xu\",\"Jiasun Feng\",\"Jiarui Wang\",\"Ying Qin\",\"Shuiyang Mao\",\"Wei Liu\",\"Shuai Wang\"]","published":"2026-06-02T13:56:31Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.MM\"]","methods":"[]","has_code":false}