{"ID":2836720,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.19900","arxiv_id":"2511.19900","title":"Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning","abstract":"Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at https://github.com/aiming-lab/Agent0.","short_abstract":"Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or rewar...","url_abs":"https://arxiv.org/abs/2511.19900","url_pdf":"https://arxiv.org/pdf/2511.19900v2","authors":"[\"Jiaqi Liu\",\"Kaiwen Xiong\",\"Peng Xia\",\"Yiyang Zhou\",\"Haonian Ji\",\"Lu Feng\",\"Siwei Han\",\"Mingyu Ding\",\"Huaxiu Yao\"]","published":"2025-11-25T04:15:14Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\"]","has_code":false,"code_links":[{"ID":606625,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2836720,"paper_url":"https://arxiv.org/abs/2511.19900","paper_title":"Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning","repo_url":"https://github.com/aiming-lab/Agent0","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}