{"ID":2838531,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.17106","arxiv_id":"2511.17106","title":"ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better","abstract":"Recent advances in multimodal reasoning models have demonstrated impressive capabilities across text and vision. However, even leading models exhibit redundant self-reflection when generating lengthy reasoning chains. While training-free CoT compression methods have emerged in the LLMs domain, they rely on static visual references and thus provide limited gains for multimodal reasoning. Therefore, we propose ChainV, a framework that dynamically integrates visual hints into the reasoning process, thereby making multimodal reasoning shorter and better. Specifically, ChainV first performs a coarse visual patch selection based on the previous reasoning step, then refines it by identifying the most representative atomic visual hint according to the averaged attention intensity. Additionally, ChainV introduces a consistency-based evaluation mechanism to assess the reliability of the chosen hint, guiding the model to adaptively adjust its level of self-reflection. Eventually, the pixel coordinates of the selected visual hint and its reliability are incorporated into thinking with a Bernoulli stochastic process. Experiments indicate that our method significantly improves reasoning accuracy and efficiency, especially on math-intensive benchmarks where visual hints are crucial for multi-step symbolic reasoning. For example, ChainV achieves $2.3\\%$ improvement on the MathVista within MIMO-VL-RL, while reducing inference latency by $51.4\\%$ and shortening output token length by $24.5\\%$.","short_abstract":"Recent advances in multimodal reasoning models have demonstrated impressive capabilities across text and vision. However, even leading models exhibit redundant self-reflection when generating lengthy reasoning chains. While training-free CoT compression methods have emerged in the LLMs domain, they rely on static visua...","url_abs":"https://arxiv.org/abs/2511.17106","url_pdf":"https://arxiv.org/pdf/2511.17106v1","authors":"[\"Yuan Zhang\",\"Ming Lu\",\"Junwen Pan\",\"Tao Huang\",\"Kuan Cheng\",\"Qi She\",\"Shanghang Zhang\"]","published":"2025-11-21T10:11:17Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\"]","has_code":false}
