{"ID":2826554,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.18735","arxiv_id":"2512.18735","title":"$M^3-Verse$: A \"Spot the Difference\" Challenge for Large Multimodal Models","abstract":"Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understanding. However, their capacity to comprehend the dynamic changes of objects within a shared spatial context between two distinct video observations, remains largely unexplored. This ability to reason about transformations within a consistent environment is particularly crucial for advancements in the field of spatial intelligence. In this paper, we introduce $M^3-Verse$, a Multi-Modal, Multi-State, Multi-Dimensional benchmark, to formally evaluate this capability. It is built upon paired videos that provide multi-perspective observations of an indoor scene before and after a state change. The benchmark contains a total of 270 scenes and 2,932 questions, which are categorized into over 50 subtasks that probe 4 core capabilities. We evaluate 16 state-of-the-art LMMs and observe their limitations in tracking state transitions. To address these challenges, we further propose a simple yet effective baseline that achieves significant performance improvements in multi-state perception. $M^3-Verse$ thus provides a challenging new testbed to catalyze the development of next-generation models with a more holistic understanding of our dynamic visual world. You can get the construction pipeline from https://github.com/Wal-K-aWay/M3-Verse_pipeline and full benchmark data from https://www.modelscope.cn/datasets/WalKaWay/M3-Verse.","short_abstract":"Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understanding. However, their capacity to comprehend the dynamic changes of objects within a shared spatial context between two distinct video observations, remains largely unexplored. This ab...","url_abs":"https://arxiv.org/abs/2512.18735","url_pdf":"https://arxiv.org/pdf/2512.18735v2","authors":"[\"Kewei Wei\",\"Bocheng Hu\",\"Jie Cao\",\"Xiaohan Chen\",\"Zhengxi Lu\",\"Wubing Xia\",\"Weili Xu\",\"Jiaao Wu\",\"Junchen He\",\"Mingyu Jia\",\"Ciyun Zhao\",\"Ye Sun\",\"Yizhi Li\",\"Zhonghan Zhao\",\"Jian Zhang\",\"Gaoang Wang\"]","published":"2025-12-21T13:50:26Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[]","project_urls":"[\"https://www.modelscope.cn/datasets/WalKaWay/M3-Verse\"]","has_code":false,"code_links":[{"ID":605751,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2826554,"paper_url":"https://arxiv.org/abs/2512.18735","paper_title":"$M^3-Verse$: A \"Spot the Difference\" Challenge for Large Multimodal Models","repo_url":"https://github.com/Wal-K-aWay/M3-Verse_pipeline","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
