{"ID":2872304,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.09658","arxiv_id":"2509.09658","title":"Measuring Epistemic Humility in Multimodal Large Language Models","abstract":"Hallucinations in multimodal large language models (MLLMs) -- where the model generates content inconsistent with the input image -- pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition accuracy, i.e., evaluating whether models can select the correct answer among distractors. This overlooks another important capability for trustworthy AI: recognizing when none of the provided options is supported by the image and abstaining from committing to a false choice, a humility-related behavior. We present HumbleBench, a new hallucination benchmark designed to evaluate false-option rejection in MLLMs under a forced-choice multiple-choice setting with a ``None of the above'' option. Built from a panoptic scene graph dataset, we leverage fine-grained scene graph annotations for objects and relations, use candidate attribute cues, and prompt GPT-4-Turbo to generate multiple-choice questions, followed by a rigorous manual filtering process. Each question includes a ``None of the above'' option, requiring models not only to recognize correct visual information but also to identify when no provided answer is valid. We evaluate a variety of state-of-the-art MLLMs -- including general-purpose, specialized reasoning, and proprietary models -- on HumbleBench and report empirical findings for the community. By incorporating explicit false-option rejection, HumbleBench fills a key gap in current evaluation suites by assessing a narrower but important abstention-oriented behavior that is relevant to trustworthy multimodal reasoning. Our code and dataset are released publicly and can be accessed at \\href{https://github.com/maifoundations/HumbleBench}{https://github.com/maifoundations/HumbleBench}.","short_abstract":"Hallucinations in multimodal large language models (MLLMs) -- where the model generates content inconsistent with the input image -- pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition acc...","url_abs":"https://arxiv.org/abs/2509.09658","url_pdf":"https://arxiv.org/pdf/2509.09658v2","authors":"[\"Bingkui Tong\",\"Jiaer Xia\",\"Sifeng Shang\",\"Kaiyang Zhou\"]","published":"2025-09-11T17:54:00Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":609961,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2872304,"paper_url":"https://arxiv.org/abs/2509.09658","paper_title":"Measuring Epistemic Humility in Multimodal Large Language Models","repo_url":"https://github.com/maifoundations/HumbleBench","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
