{"ID":2833959,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.02650","arxiv_id":"2512.02650","title":"Hear What Matters! Text-conditioned Selective Video-to-Audio Generation","abstract":"This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. We propose SELVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector to distinctly extract prompt-relevant sound-source visual features from the video encoder. To suppress text-irrelevant activations with efficient video encoder finetuning, the proposed supplementary tokens promote cross-attention to yield robust semantic and temporal grounding. SELVA further employs an autonomous video-mixing scheme in a self-supervised manner to overcome the lack of mono audio track supervision. We evaluate SELVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization.","short_abstract":"This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixin...","url_abs":"https://arxiv.org/abs/2512.02650","url_pdf":"https://arxiv.org/pdf/2512.02650v2","authors":"[\"Junwon Lee\",\"Juhan Nam\",\"Jiyoung Lee\"]","published":"2025-12-02T11:12:16Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.LG\",\"cs.MM\",\"cs.SD\",\"eess.AS\"]","methods":"[]","has_code":false}
