{"ID":2858632,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.08618","arxiv_id":"2510.08618","title":"VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models","abstract":"Omni-modal large language models (OLLMs) offer a promising end-to-end solution for slide-enhanced speech recognition due to their inherent multimodal capabilities. However, we found a fundamental issue faced by OLLMs: \\textit{Visual Interference}, where models show a bias towards visible text over auditory signals, causing them to hallucinate slide content that was never spoken. To address this, we propose Visually-Anchored Policy Optimization (VAPO), which aims to reshape models' inference process to follow the human-like ``Look-then-Listen'' inference chain. Specifically, we design a temporally decoupled policy: the model first extracts visual priors in a \u003cthink\u003e block to serve as semantic anchors, then generates the transcription in an \u003canswer\u003e block. The policy is optimized via multi-objective reinforcement learning. Furthermore, we introduce SlideASR-Bench, a comprehensive benchmark designed to address the scarcity of entity-rich data, comprising a large-scale synthetic corpus for training and a challenging real-world test set for evaluation. We conduct extensive evaluations demonstrating that VAPO effectively eliminates visual interference and achieves state-of-the-art performance on SlideASR-Bench and public datasets, significantly reducing entity recognition errors in specialized domains.","short_abstract":"Omni-modal large language models (OLLMs) offer a promising end-to-end solution for slide-enhanced speech recognition due to their inherent multimodal capabilities. However, we found a fundamental issue faced by OLLMs: \\textit{Visual Interference}, where models show a bias towards visible text over auditory signals, cau...","url_abs":"https://arxiv.org/abs/2510.08618","url_pdf":"https://arxiv.org/pdf/2510.08618v2","authors":"[\"Rui Hu\",\"Delai Qiu\",\"Yining Wang\",\"Shengping Liu\",\"Jitao Sang\"]","published":"2025-10-08T08:18:47Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.CV\",\"cs.SD\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\"]","has_code":false}