{"ID":2863896,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.25373","arxiv_id":"2509.25373","title":"From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models","abstract":"Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world, but often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition). This disconnect leads to a spectrum of reasoning failures, with hallucination being the most prominent. Collectively, these issues expose a fundamental challenge: the ability to process pixels does not yet confer the ability to construct a coherent, credible internal world model. To systematically dissect and address this challenge, this survey introduces a novel and unified analytical framework: ``From Perception to Cognition.\" We deconstruct the complex process of vision-language interactive understanding into two interdependent layers: Perception, the foundational ability to accurately extract visual information and achieve fine-grained alignment with textual instructions; and Cognition, the higher-order capability for proactive, multi-step, goal-oriented reasoning built upon this perceptual foundation, the core of which is the formation of a dynamic observe-think-verify reasoning loop. Guided by this framework, this paper systematically analyzes the key bottlenecks of current MLLMs at both layers. It surveys the landscape of cutting-edge methods designed to address these challenges, spanning from techniques that enhance low-level visual representations to those that improve high-level reasoning paradigms. Furthermore, we review critical benchmarks and delineate future research directions. This survey aims to provide the research community with a clear, structured perspective for understanding the intrinsic limitations of current MLLMs and to illuminate the path toward building next-generation models capable of deep reasoning and a genuine understanding of the world.","short_abstract":"Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world, but often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition). This disconnect leads to a spectrum of reasonin...","url_abs":"https://arxiv.org/abs/2509.25373","url_pdf":"https://arxiv.org/pdf/2509.25373v4","authors":"[\"Chenyue Zhou\",\"Mingxuan Wang\",\"Yanbiao Ma\",\"Chenxu Wu\",\"Wanyi Chen\",\"Zhe Qian\",\"Xinyu Liu\",\"Yiwei Zhang\",\"Junhao Wang\",\"Hengbo Xu\",\"Fei Luo\",\"Xiaohua Chen\",\"Xiaoshuai Hao\",\"Hehan Li\",\"Andi Zhang\",\"Wenxuan Wang\",\"Kaiyan Zhang\",\"Guoli Jia\",\"Lingling Li\",\"Zhiwu Lu\",\"Yang Lu\",\"Yike Guo\"]","published":"2025-09-29T18:25:40Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}