{"ID":2845377,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.04460","arxiv_id":"2511.04460","title":"V-Thinker: Interactive Thinking with Images","abstract":"Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising \"Thinking with Images\" paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.","short_abstract":"Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising \"Thinking with Images\" paradigm for LMMs, marking a shift from image-assisted reasonin...","url_abs":"https://arxiv.org/abs/2511.04460","url_pdf":"https://arxiv.org/pdf/2511.04460v2","authors":"[\"Runqi Qiao\",\"Qiuna Tan\",\"Minghan Yang\",\"Guanting Dong\",\"Peiqing Yang\",\"Shiqiang Lang\",\"Enhui Wan\",\"Xiaowan Wang\",\"Yida Xu\",\"Lan Yang\",\"Chong Sun\",\"Chen Li\",\"Jing Lyu\",\"Honggang Zhang\"]","published":"2025-11-06T15:32:29Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Reinforcement Learning\"]","has_code":false}
