{"ID":2826559,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.18745","arxiv_id":"2512.18745","title":"InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search","abstract":"The ability for AI agents to \"think with images\" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search -- locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3-like open systems. Our code and dataset can be found at https://github.com/m-Just/InSight-o3 .","short_abstract":"The ability for AI agents to \"think with images\" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap,...","url_abs":"https://arxiv.org/abs/2512.18745","url_pdf":"https://arxiv.org/pdf/2512.18745v1","authors":"[\"Kaican Li\",\"Lewei Yao\",\"Jiannan Wu\",\"Tiezheng Yu\",\"Jierun Chen\",\"Haoli Bai\",\"Lu Hou\",\"Lanqing Hong\",\"Wei Zhang\",\"Nevin L. Zhang\"]","published":"2025-12-21T14:23:07Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.CL\",\"cs.LG\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\"]","has_code":false,"code_links":[{"ID":605752,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2826559,"paper_url":"https://arxiv.org/abs/2512.18745","paper_title":"InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search","repo_url":"https://github.com/m-Just/InSight-o3","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
