{"ID":2837158,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.20785","arxiv_id":"2511.20785","title":"LongVT: Incentivizing \"Thinking with Long Videos\" via Native Tool Calling","abstract":"Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables \"Thinking with Long Videos\" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .","short_abstract":"Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming...","url_abs":"https://arxiv.org/abs/2511.20785","url_pdf":"https://arxiv.org/pdf/2511.20785v3","authors":"[\"Zuhao Yang\",\"Sudong Wang\",\"Kaichen Zhang\",\"Keming Wu\",\"Sicong Leng\",\"Yifan Zhang\",\"Bo Li\",\"Chengwei Qin\",\"Shijian Lu\",\"Xingxuan Li\",\"Lidong Bing\"]","published":"2025-11-25T19:22:48Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Reinforcement Learning\"]","has_code":false,"code_links":[{"ID":606661,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2837158,"paper_url":"https://arxiv.org/abs/2511.20785","paper_title":"LongVT: Incentivizing \"Thinking with Long Videos\" via Native Tool Calling","repo_url":"https://github.com/EvolvingLMMs-Lab/LongVT","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
