{"ID":3084871,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T05:49:02.101151534Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05736","arxiv_id":"2606.05736","title":"VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning","abstract":"Video reasoning aims to understand complex temporal events and causal relationships within videos. Recently, Chain-of-Thought (CoT) has been introduced to this field to enhance reasoning accuracy. However, existing CoT-based video reasoning methods primarily rely on text-only information for logical deduction, overlooking critical visual information during the inference process. Inspired by the human cognitive mechanism of reviewing visual segments during inference, we propose VTI-CoT, a Visual-Textual Interleaved CoT framework. VTI-CoT integrates textual reasoning steps with corresponding visual frames. Given the scarcity of visual-textual interleaved CoT in existing datasets, we develop an automated annotation pipeline to construct high-quality multimodal CoT data. Further, reasoning over long-form videos entails increasingly long CoT token sequences, which severely hinders training convergence and efficiency. To address this, we employ Optical Character Recognition (OCR)-based compression techniques to compress CoT supervision signals into a single canvas. Experimental results demonstrate that VTI-CoT achieves state-of-the-art performance among models of the same parameter scale while significantly improving training efficiency.","short_abstract":"Video reasoning aims to understand complex temporal events and causal relationships within videos. Recently, Chain-of-Thought (CoT) has been introduced to this field to enhance reasoning accuracy. However, existing CoT-based video reasoning methods primarily rely on text-only information for logical deduction, overlook...","url_abs":"https://arxiv.org/abs/2606.05736","url_pdf":"https://arxiv.org/pdf/2606.05736v1","authors":"[\"Shufan Zhang\",\"Ziyue Lin\",\"Bairun Wang\",\"Lei Jin\",\"Xuanding Ding\",\"Xinzhu Ma\",\"Kunlin Yang\"]","published":"2026-06-04T05:55:15Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false}
