{"ID":2877443,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.19542","arxiv_id":"2508.19542","title":"CVBench: Benchmarking Cross-Video Synergies for Complex Multimodal Reasoning","abstract":"While multimodal large language models (MLLMs) exhibit strong performance on single-video tasks (e.g., video question answering), their capability for spatiotemporal pattern reasoning across multiple videos remains a critical gap in pattern recognition research. However, this capability is essential for real-world applications, including multi-camera surveillance and cross-video procedural learning. To bridge this gap, we present CVBench, the first diagnostic benchmark designed to assess cross-video relational reasoning rigorously. CVBench comprises 1,000 question-answer pairs spanning three hierarchical tiers: cross-video object association (identifying shared entities), cross-video event association (linking temporal or causal event chains), and cross-video complex reasoning (integrating commonsense and domain knowledge). Built from five domain-diverse video clusters (e.g., sports, life records), the benchmark challenges models to analyze and integrate spatiotemporal patterns from dynamic visual streams. Extensive evaluation of 10+ leading MLLMs (including GPT-4o, Gemini-2.0-flash, Qwen2.5-VL) under zero-shot or chain-of-thought prompting paradigms. Key findings reveal stark performance gaps: even top models, such as GPT-4o, achieve only 63.5% accuracy on causal reasoning tasks, compared to the 91.3% accuracy of human performance. Crucially, our analysis reveals fundamental bottlenecks inherent in current MLLMs architectures, notably deficient inter-video context retention and poor disambiguation of overlapping entities. CVBench establishes a rigorous framework for advancing pattern recognition methodologies in multi-video scenarios, providing architectural insights for next-generation models. The data and evaluation code are available at: https://github.com/Hokhim2/CVBench.","short_abstract":"While multimodal large language models (MLLMs) exhibit strong performance on single-video tasks (e.g., video question answering), their capability for spatiotemporal pattern reasoning across multiple videos remains a critical gap in pattern recognition research. However, this capability is essential for real-world appl...","url_abs":"https://arxiv.org/abs/2508.19542","url_pdf":"https://arxiv.org/pdf/2508.19542v4","authors":"[\"Nannan Zhu\",\"Yonghao Dong\",\"Teng Wang\",\"Xueqian Li\",\"Shengjun Deng\",\"Yijia Wang\",\"Zheng Hong\",\"Tiantian Geng\",\"Guo Niu\",\"Hanyan Huang\",\"Xiongfei Yao\",\"Shuaiwei Jiao\"]","published":"2025-08-27T03:29:35Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":610378,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2877443,"paper_url":"https://arxiv.org/abs/2508.19542","paper_title":"CVBench: Benchmarking Cross-Video Synergies for Complex Multimodal Reasoning","repo_url":"https://github.com/Hokhim2/CVBench","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
