{"ID":2855056,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.13276","arxiv_id":"2510.13276","title":"MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models","abstract":"The rapid advancement of large vision language models (LVLMs) has led to a significant expansion of their context windows. However, an extended context window does not guarantee the effective utilization of the context, posing a critical challenge for real-world applications. Current evaluations of such long-context faithfulness are predominantly focused on the text-only domain, while multimodal assessments remain limited to short contexts. To bridge this gap, we introduce MMLongCite, a comprehensive benchmark designed to evaluate the fidelity of LVLMs in long-context scenarios. MMLongCite comprises 8 distinct tasks spanning 6 context length intervals and incorporates diverse modalities, including text, images, and videos. Our evaluation of state-of-the-art LVLMs reveals their limited faithfulness in handling long multimodal contexts. Furthermore, we provide an in-depth analysis of how context length and the position of crucial content affect the faithfulness of these models.","short_abstract":"The rapid advancement of large vision language models (LVLMs) has led to a significant expansion of their context windows. However, an extended context window does not guarantee the effective utilization of the context, posing a critical challenge for real-world applications. Current evaluations of such long-context fa...","url_abs":"https://arxiv.org/abs/2510.13276","url_pdf":"https://arxiv.org/pdf/2510.13276v1","authors":"[\"Keyan Zhou\",\"Zecheng Tang\",\"Lingfeng Ming\",\"Guanghao Zhou\",\"Qiguang Chen\",\"Dan Qiao\",\"Zheming Yang\",\"Libo Qin\",\"Minghui Qiu\",\"Juntao Li\",\"Min Zhang\"]","published":"2025-10-15T08:22:03Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.CL\"]","methods":"[\"Language Model\"]","has_code":false}
