{"ID":2889610,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.20766","arxiv_id":"2507.20766","title":"Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback","abstract":"Multimodal Large Language Models (MLLMs) exhibit impressive performance across various visual tasks. Subsequent investigations into enhancing their visual reasoning abilities have significantly expanded their performance envelope. However, a critical bottleneck in the advancement of MLLMs toward deep visual reasoning is their heavy reliance on curated image-text supervision. To solve this problem, we introduce a novel framework, ``Reasoning-Rendering-Visual-Feedback'' (RRVF), that enables MLLMs to learn complex visual reasoning from only raw images. This framework builds on the ``Asymmetry of Verification'' principle, i.e., verifying the rendered output against the source image is substantially easier than performing deep visual reasoning to generate a faithful, structured representation such as code. We demonstrate that this relative ease provides an ideal reward signal for optimization via Reinforcement Learning (RL), thereby reducing reliance on image-text supervision. RRVF implements a closed-loop iterative process encompassing reasoning, rendering, and visual feedback components, enabling the model to perform complex reasoning, including self-correction through multi-turn interactions. This process is optimized end-to-end using the GRPO algorithm. Extensive evaluations are conducted on image-to-code generation across two diverse domains: data charts and web interfaces. The RRVF-trained model not only outperforms existing similarly sized open-source MLLMs and supervised fine-tuning baselines but also exhibits superior generalization. Notably, the model outperforms the more advanced MLLM used to generate visual feedback during training. Code is available at https://github.com/L-O-I/RRVF.","short_abstract":"Multimodal Large Language Models (MLLMs) exhibit impressive performance across various visual tasks. Subsequent investigations into enhancing their visual reasoning abilities have significantly expanded their performance envelope. However, a critical bottleneck in the advancement of MLLMs toward deep visual reasoning i...","url_abs":"https://arxiv.org/abs/2507.20766","url_pdf":"https://arxiv.org/pdf/2507.20766v4","authors":"[\"Yang Chen\",\"Yufan Shen\",\"Wenxuan Huang\",\"Sheng Zhou\",\"Qunshu Lin\",\"Xinyu Cai\",\"Zhi Yu\",\"Jiajun Bu\",\"Botian Shi\",\"Yu Qiao\"]","published":"2025-07-28T12:21:19Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":611666,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2889610,"paper_url":"https://arxiv.org/abs/2507.20766","paper_title":"Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback","repo_url":"https://github.com/L-O-I/RRVF","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
