{"ID":2864989,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.22744","arxiv_id":"2509.22744","title":"Index-MSR: A high-efficiency multimodal fusion framework for speech recognition","abstract":"Driven by large scale datasets and LLM based architectures, automatic speech recognition (ASR) systems have achieved remarkable improvements in accuracy. However, challenges persist for domain-specific terminology, and short utterances lacking semantic coherence, where recognition performance often degrades significantly. In this work, we present Index-MSR, an efficient multimodal speech recognition framework. At its core is a novel Multimodal Fusion Decoder (MFD), which effectively incorporates text-related information from videos (e.g., subtitles and presentation slides) into the speech recognition. This cross-modal integration not only enhances overall ASR accuracy but also yields substantial reductions in substitution errors. Extensive evaluations on both an in-house subtitle dataset and a public AVSR dataset demonstrate that Index-MSR achieves sota accuracy, with substitution errors reduced by 20,50%. These results demonstrate that our approach efficiently exploits text-related cues from video to improve speech recognition accuracy, showing strong potential in applications requiring strict audio text synchronization, such as audio translation.","short_abstract":"Driven by large scale datasets and LLM based architectures, automatic speech recognition (ASR) systems have achieved remarkable improvements in accuracy. However, challenges persist for domain-specific terminology, and short utterances lacking semantic coherence, where recognition performance often degrades significant...","url_abs":"https://arxiv.org/abs/2509.22744","url_pdf":"https://arxiv.org/pdf/2509.22744v1","authors":"[\"Jinming Chen\",\"Lu Wang\",\"Zheshu Song\",\"Wei Deng\"]","published":"2025-09-26T03:47:15Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.AI\",\"cs.MM\",\"cs.SD\"]","methods":"[\"Large Language Model\"]","has_code":false}
