{"ID":2867781,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.17901","arxiv_id":"2509.17901","title":"Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy","abstract":"Speech and audio encoders developed over years of community effort are routinely excluded from video understanding pipelines -- not because they fail, but because benchmarks never required listening. We audit 10 video benchmarks and find items largely solvable from visual cues alone: a single-frame probe answers ~76% of AVQA without audio, suggesting poor measurement of audio-visual reasoning. Building on LLaVA-OneVision, we attach a speech/audio encoder and compare five compressor architectures under 25x token reduction (25 Hz to 1 Hz). Across 10 benchmarks -- with and without filtering -- audio yields clear gains on tasks requiring speech comprehension or cross-modal grounding, while vision-centric suites remain largely unaffected. Our results show that speech encoders play a larger role in video understanding than current benchmarks suggest. We will fully open-source our work at https://github.com/naver-ai/LLaVA-AV-SSM.","short_abstract":"Speech and audio encoders developed over years of community effort are routinely excluded from video understanding pipelines -- not because they fail, but because benchmarks never required listening. We audit 10 video benchmarks and find items largely solvable from visual cues alone: a single-frame probe answers ~76% o...","url_abs":"https://arxiv.org/abs/2509.17901","url_pdf":"https://arxiv.org/pdf/2509.17901v3","authors":"[\"Geewook Kim\",\"Minjoon Seo\"]","published":"2025-09-22T15:28:54Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.MM\",\"cs.SD\"]","methods":"[\"Large Language Model\"]","has_code":false,"code_links":[{"ID":609511,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2867781,"paper_url":"https://arxiv.org/abs/2509.17901","paper_title":"Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy","repo_url":"https://github.com/naver-ai/LLaVA-AV-SSM","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}