{"ID":2850572,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.21311","arxiv_id":"2510.21311","title":"FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning","abstract":"Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities across a wide range of vision-language tasks. However, due to the restricted input resolutions, MLLMs face significant challenges in precisely understanding and localizing visual details in high-resolution images -- particularly when dealing with extra-small objects embedded in cluttered contexts. To address this issue, we propose \\textsc{FineRS}, a two-stage MLLM-based reinforcement learning framework for jointly reasoning and segmenting extremely small objects within high-resolution scenes. \\textsc{FineRS} adopts a coarse-to-fine pipeline comprising Global Semantic Exploration (GSE) and Localized Perceptual Refinement (LPR). Specifically, GSE performs instruction-guided reasoning to generate a textural response and a coarse target region, while LPR refines this region to produce an accurate bounding box and segmentation mask. To couple the two stages, we introduce a locate-informed retrospective reward, where LPR's outputs are used to optimize GSE for more robust coarse region exploration. % Additionally, we present \\textsc{FineRS}-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets in complex high-resolution scenes. Experimental results on \\textsc{FineRS}-4k and public datasets demonstrate that our method consistently outperforms state-of-the-art MLLM-based approaches on both instruction-guided segmentation and visual reasoning tasks.","short_abstract":"Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities across a wide range of vision-language tasks. However, due to the restricted input resolutions, MLLMs face significant challenges in precisely understanding and localizing visual details in high-resolution images -- particularly when dealing w...","url_abs":"https://arxiv.org/abs/2510.21311","url_pdf":"https://arxiv.org/pdf/2510.21311v1","authors":"[\"Lu Zhang\",\"Jiazuo Yu\",\"Haomiao Xiong\",\"Ping Hu\",\"Yunzhi Zhuge\",\"Huchuan Lu\",\"You He\"]","published":"2025-10-24T10:14:17Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\",\"LoRA\"]","has_code":false}