{"ID":2864959,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.21788","arxiv_id":"2509.21788","title":"MIRG-RL: Multi-Image Reasoning and Grounding with Reinforcement Learning","abstract":"Multi-image reasoning and grounding require understanding complex cross-image relationships at both object levels and image levels. Current Large Visual Language Models (LVLMs) face two critical challenges: the lack of cross-image reasoning capabilities and insufficient cross-image reference reward modeling. To address these issues, we propose a unified framework - Multi-Image Reasoning and Grounding with Reinforcement Learning (MIRG-RL). Specifically, our two-stage training paradigm combines supervised fine-tuning with annotated trajectories and image-aware reinforcement learning optimization, progressively developing multi-image reasoning capabilities. Furthermore, we innovatively propose a method for constructing the trajectory data, which integrates object-level and image-level annotation information, and use this method to generate a lightweight reasoning-enhanced dataset. To effectively resolve cross-image ambiguities, we design an image-aware RL policy with dual reward functions for objects and images. Experiments demonstrate that MIRG-RL achieves state-of-the-art (SOTA) performance in multi-image grounding benchmarks, attaining 64.82% on cross-image reasoning tasks - exceeding the previous best method by 1%. The code and dataset have been released at https://github.com/ZEUS2035/MIRG-RL.","short_abstract":"Multi-image reasoning and grounding require understanding complex cross-image relationships at both object levels and image levels. Current Large Visual Language Models (LVLMs) face two critical challenges: the lack of cross-image reasoning capabilities and insufficient cross-image reference reward modeling. To address...","url_abs":"https://arxiv.org/abs/2509.21788","url_pdf":"https://arxiv.org/pdf/2509.21788v1","authors":"[\"Lihao Zheng\",\"Jiawei Chen\",\"Xintian Shen\",\"Hao Ma\",\"Tao Wei\"]","published":"2025-09-26T02:43:22Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false,"code_links":[{"ID":609218,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2864959,"paper_url":"https://arxiv.org/abs/2509.21788","paper_title":"MIRG-RL: Multi-Image Reasoning and Grounding with Reinforcement Learning","repo_url":"https://github.com/ZEUS2035/MIRG-RL","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
