{"ID":2892934,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.13609","arxiv_id":"2507.13609","title":"CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks","abstract":"Despite recent progress in video large language models (VideoLLMs), a key open challenge remains: how to equip models with chain-of-thought (CoT) reasoning abilities grounded in fine-grained object-level video understanding. Existing instruction-tuned models, such as the Qwen and LLaVA series, are trained on high-level video-text pairs, often lacking structured annotations necessary for compositional, step-by-step reasoning. We propose CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks, a new framework that decomposes complex video questions of existing datasets (e.g., NeXT-QA, STAR) into four entity-level foundational tasks: frame localization, entity tracking, spatial and temporal relation extraction. By embedding these intermediate CoT-style reasoning steps into the input, CoTasks enables models to explicitly perform object-centric spatiotemporal reasoning. Experiments on the NeXT-QA benchmark show that CoTasks significantly enhance inference performance: LLaVA-video-7B improves by +3.3 points in average GPT-4 evaluation score, and Qwen2.5-VL-3B gains +17.4, with large boosts in causal (+14.6), temporal (+10.9), and descriptive (+48.1) subcategories. These results demonstrate the effectiveness of CoTasks as a structured CoT-style supervision framework for improving compositional video reasoning.","short_abstract":"Despite recent progress in video large language models (VideoLLMs), a key open challenge remains: how to equip models with chain-of-thought (CoT) reasoning abilities grounded in fine-grained object-level video understanding. Existing instruction-tuned models, such as the Qwen and LLaVA series, are trained on high-level...","url_abs":"https://arxiv.org/abs/2507.13609","url_pdf":"https://arxiv.org/pdf/2507.13609v1","authors":"[\"Yanan Wang\",\"Julio Vizcarra\",\"Zhi Li\",\"Hao Niu\",\"Mori Kurokawa\"]","published":"2025-07-18T02:29:19Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
