{"ID":2845618,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.03146","arxiv_id":"2511.03146","title":"MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity","abstract":"As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs' cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs. 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak (less than or equal to 30%). We further identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions, and observe that Chain-of-Thought typically follows a three-stage process (extract -\u003e reason -\u003e verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.","short_abstract":"As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric co...","url_abs":"https://arxiv.org/abs/2511.03146","url_pdf":"https://arxiv.org/pdf/2511.03146v2","authors":"[\"Kaiyuan Zhang\",\"Chenghao Yang\",\"Zhoufutu Wen\",\"Sihang Yuan\",\"Qiuyue Wang\",\"Chaoyi Huang\",\"Guosheng Zhu\",\"He Wang\",\"Huawenyu Lu\",\"Jianing Wen\",\"Jianpeng Jiao\",\"Lishu Luo\",\"Longxiang Liu\",\"Sijin Wu\",\"Xiaolei Zhu\",\"Xuanliang Zhang\",\"Yu Liu\",\"Ge Zhang\",\"Yi Lin\",\"Guang Shi\",\"Chaoyou Fu\",\"Wenhao Huang\"]","published":"2025-11-05T03:09:16Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Generative Adversarial Network\"]","has_code":false}
