{"ID":2826228,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.19302","arxiv_id":"2512.19302","title":"Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Optical Remote Sensing","abstract":"Large Vision--Language Models (LVLMs) hold great promise for advancing optical remote sensing (RS) analysis, yet existing reasoning segmentation frameworks couple linguistic reasoning and pixel prediction through end-to-end supervised fine-tuning, leading to weak geometric grounding and limited generalization across tasks. To address this, we developed Think2Seg-RS, a decoupled framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts. Through a mask-only Group Relative Policy Optimization (GRPO) reinforcement learning objective driven strictly by final mask IoU, the LVLM learns to translate abstract semantic reasoning into spatially grounded actions, achieving state-of-the-art performance on the EarthReason dataset. Notably, Think2Seg-RS outperforms leading approaches such as RemoteReasoner and SegEarth-R1 on the EarthReason dataset by reaching a test cIoU of 75.60% and gIoU of 73.36%, yielding absolute improvements of 6.47% and 2.40% over the strongest baseline, respectively. Zero-shot evaluations across three referring segmentation benchmarks reveal a fundamental distinction in task inductive bias, exposing a distinct divide between semantic-level grounding -- which aggregates all regions matching a conceptual intent -- and instance-level tasks that demand discrete object separation. We further found that compact segmenters outperform larger ones under semantic-level supervision by mitigating textural over-segmentation, and that unconstrained negative prompting is unstable in heterogeneous aerial backgrounds. Together, these findings demonstrate that optimizing LVLMs through direct segmentation feedback offers a scalable framework for complex geospatial reasoning, effectively bridging the gap between abstract language understanding and precise pixel-level execution.","short_abstract":"Large Vision--Language Models (LVLMs) hold great promise for advancing optical remote sensing (RS) analysis, yet existing reasoning segmentation frameworks couple linguistic reasoning and pixel prediction through end-to-end supervised fine-tuning, leading to weak geometric grounding and limited generalization across ta...","url_abs":"https://arxiv.org/abs/2512.19302","url_pdf":"https://arxiv.org/pdf/2512.19302v2","authors":"[\"Xu Zhang\",\"Junyao Ge\",\"Yang Zheng\",\"Kaitai Guo\",\"Jimin Liang\"]","published":"2025-12-22T11:46:42Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false}
