{"ID":2864683,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.23251","arxiv_id":"2509.23251","title":"XGC-AVis: Towards Audio-Visual Content Understanding with a Multi-Agent Collaborative System","abstract":"In this paper, we propose XGC-AVis, a multi-agent framework that enhances the audio-video temporal alignment capabilities of multimodal large models (MLLMs) and improves the efficiency of retrieving key video segments through 4 stages: perception, planning, execution, and reflection. We further introduce XGC-AVQuiz, the first benchmark aimed at comprehensively assessing MLLMs' understanding capabilities in both real-world and AI-generated scenarios. XGC-AVQuiz consists of 2,685 question-answer pairs across 20 tasks, with two key innovations: 1) AIGC Scenario Expansion: The benchmark includes 2,232 videos, comprising 1,102 professionally generated content (PGC), 753 user-generated content (UGC), and 377 AI-generated content (AIGC). These videos cover 10 major domains and 53 fine-grained categories. 2) Quality Perception Dimension: Beyond conventional tasks such as recognition, localization, and reasoning, we introduce a novel quality perception dimension. This requires MLLMs to integrate low-level sensory capabilities with high-level semantic understanding to assess audio-visual quality, synchronization, and coherence. Experimental results on XGC-AVQuiz demonstrate that current MLLMs struggle with quality perception and temporal alignment tasks. XGC-AVis improves these capabilities without requiring additional training, as validated on two benchmarks.","short_abstract":"In this paper, we propose XGC-AVis, a multi-agent framework that enhances the audio-video temporal alignment capabilities of multimodal large models (MLLMs) and improves the efficiency of retrieving key video segments through 4 stages: perception, planning, execution, and reflection. We further introduce XGC-AVQuiz, th...","url_abs":"https://arxiv.org/abs/2509.23251","url_pdf":"https://arxiv.org/pdf/2509.23251v1","authors":"[\"Yuqin Cao\",\"Xiongkuo Min\",\"Yixuan Gao\",\"Wei Sun\",\"Zicheng Zhang\",\"Jinliang Han\",\"Guangtao Zhai\"]","published":"2025-09-27T11:01:48Z","proceeding":"cs.MM","tasks":"[\"cs.MM\",\"cs.SD\"]","methods":"[\"Large Language Model\"]","has_code":false}
