{"ID":2831165,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.08511","arxiv_id":"2512.08511","title":"Thinking with Images via Self-Calling Agent","abstract":"Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9\\%$ with $\\sim 75\\%$ fewer GPU hours compared to strong baseline approaches. Code is available at https://github.com/YWenxi/think-with-images-through-self-calling.","short_abstract":"Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reaso...","url_abs":"https://arxiv.org/abs/2512.08511","url_pdf":"https://arxiv.org/pdf/2512.08511v2","authors":"[\"Wenxi Yang\",\"Yuzhong Zhao\",\"Fang Wan\",\"Qixiang Ye\"]","published":"2025-12-09T11:53:21Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Reinforcement Learning\"]","has_code":false,"code_links":[{"ID":606099,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2831165,"paper_url":"https://arxiv.org/abs/2512.08511","paper_title":"Thinking with Images via Self-Calling Agent","repo_url":"https://github.com/YWenxi/think-with-images-through-self-calling","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
