{"ID":2842137,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.10059","arxiv_id":"2511.10059","title":"When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?","abstract":"Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an ``Audio-Visual Confusion'' scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs Is there a/an muted-object sound''. Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we design a Step-wise Reasoning Reward function that enables MLLMs to self-improve audio-visual reasoning with the audio-only reference. 2) To ensure an accurate answer prediction, we introduce Answer-centered Confidence Optimization to reduce the uncertainty of potential heterogeneous reasoning differences. Extensive experiments on audio-visual question answering and audio-visual hallucination show that RL-CoMM improves the accuracy by 10~30\\% over the baseline model with limited training data. Follow: https://github.com/rikeilong/AVConfusion.","short_abstract":"Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an ``Audio-Visual Confusion'' scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding obje...","url_abs":"https://arxiv.org/abs/2511.10059","url_pdf":"https://arxiv.org/pdf/2511.10059v1","authors":"[\"Qilang Ye\",\"Wei Zeng\",\"Meng Liu\",\"Jie Zhang\",\"Yupeng Hu\",\"Zitong Yu\",\"Yu Zhou\"]","published":"2025-11-13T07:59:41Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":607106,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2842137,"paper_url":"https://arxiv.org/abs/2511.10059","paper_title":"When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?","repo_url":"https://github.com/rikeilong/AVConfusion","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
