{"ID":2840031,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.14446","arxiv_id":"2511.14446","title":"Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding","abstract":"Video understanding requires not only visual recognition but also complex reasoning. While Vision-Language Models (VLMs) demonstrate impressive capabilities, they typically process videos largely in a single-pass manner with limited support for evidence revisit and iterative refinement. While recently emerging agent-based methods enable long-horizon reasoning, they either depend heavily on expensive proprietary models or require extensive agentic RL training. To overcome these limitations, we propose Agentic Video Intelligence (AVI), a flexible and training-free framework that can mirror human video comprehension through system-level design and optimization. AVI introduces three key innovations: (1) a human-inspired three-phase reasoning process (Retrieve-Perceive-Review) that ensures both sufficient global exploration and focused local analysis, (2) a structured video knowledge base organized through entity graphs, along with multi-granularity integrated tools, constituting the agent's interaction environment, and (3) an open-source model ensemble combining reasoning LLMs with lightweight base CV models and VLM, eliminating dependence on proprietary APIs or RL training. Experiments on LVBench, VideoMME-Long, LongVideoBench, and Charades-STA demonstrate that AVI achieves competitive performance while offering superior interpretability.","short_abstract":"Video understanding requires not only visual recognition but also complex reasoning. While Vision-Language Models (VLMs) demonstrate impressive capabilities, they typically process videos largely in a single-pass manner with limited support for evidence revisit and iterative refinement. While recently emerging agent-ba...","url_abs":"https://arxiv.org/abs/2511.14446","url_pdf":"https://arxiv.org/pdf/2511.14446v1","authors":"[\"Hong Gao\",\"Yiming Bao\",\"Xuezhen Tu\",\"Yutong Xu\",\"Yue Jin\",\"Yiyang Mu\",\"Bin Zhong\",\"Linan Yue\",\"Min-Ling Zhang\"]","published":"2025-11-18T12:43:15Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\",\"LoRA\",\"Generative Adversarial Network\"]","has_code":false}
