{"ID":2884723,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.06259","arxiv_id":"2508.06259","title":"SIFThinker: Spatially-Aware Image Focus for Visual Reasoning","abstract":"Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their focus on prompt-relevant regions. In this paper, we introduce SIFThinker, a spatially-aware \"think-with-images\" framework that mimics human visual perception. Specifically, SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language. Our contributions are twofold: First, we introduce a reverse-expansion-forward-inference strategy that facilitates the generation of interleaved image-text chains of thought for process-level supervision, which in turn leads to the construction of the SIF-50K dataset. Besides, we propose GRPO-SIF, a reinforced training paradigm that integrates depth-informed visual grounding into a unified reasoning pipeline, teaching the model to dynamically correct and focus on prompt-relevant regions. Extensive experiments demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception, while maintaining strong general capabilities, highlighting the effectiveness of our method. Code: https://github.com/zhangquanchen/SIFThinker.","short_abstract":"Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their f...","url_abs":"https://arxiv.org/abs/2508.06259","url_pdf":"https://arxiv.org/pdf/2508.06259v5","authors":"[\"Zhangquan Chen\",\"Ruihui Zhao\",\"Chuwei Luo\",\"Mingze Sun\",\"Xinlei Yu\",\"Yangyang Kang\",\"Ruqi Huang\"]","published":"2025-08-08T12:26:20Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":611109,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2884723,"paper_url":"https://arxiv.org/abs/2508.06259","paper_title":"SIFThinker: Spatially-Aware Image Focus for Visual Reasoning","repo_url":"https://github.com/zhangquanchen/SIFThinker","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
