{"ID":2868668,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.08581","arxiv_id":"2510.08581","title":"Evaluating Hallucinations in Audio-Visual Multimodal LLMs with Spoken Queries under Diverse Acoustic Conditions","abstract":"Hallucinations in multimodal models have been extensively studied using benchmarks that probe reliability in image-text query settings. However, the effect of spoken queries on multimodal hallucinations remains largely unexplored, despite the growing role of voice interfaces. In this paper, we introduce a systematic pipeline that converts existing multimodal hallucination benchmarks into spoken-query versions while preserving the original tasks and labels. We instantiate this pipeline on RePOPE and release RePOPE-Spk, where all queries are provided as spoken audio under diverse input conditions. Experimental results show that hallucinations escalate when queries are spoken rather than written: error rates increase by 3-6% with clean speech and by up to 30% under environmental noise. Furthermore, many-shot prompting and chain-of-thought reasoning provide only partial mitigation. Our findings motivate new directions for building reliable voice interface systems and evaluations.","short_abstract":"Hallucinations in multimodal models have been extensively studied using benchmarks that probe reliability in image-text query settings. However, the effect of spoken queries on multimodal hallucinations remains largely unexplored, despite the growing role of voice interfaces. In this paper, we introduce a systematic pi...","url_abs":"https://arxiv.org/abs/2510.08581","url_pdf":"https://arxiv.org/pdf/2510.08581v2","authors":"[\"Hansol Park\",\"Hoseong Ahn\",\"Junwon Moon\",\"Yejin Lee\",\"Kyuhong Shim\"]","published":"2025-09-19T07:18:45Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.AI\",\"eess.AS\"]","methods":"[\"Large Language Model\"]","has_code":false}
