{"ID":2880820,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.14153","arxiv_id":"2508.14153","title":"LENS: Learning to Segment Anything with Unified Reinforced Reasoning","abstract":"Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. We propose unified reinforcement-learning rewards that span sentence-, box-, and segment-level cues, encouraging the model to generate informative CoT rationales while refining mask quality. Using a publicly available 3-billion-parameter vision-language model, i.e., Qwen2.5-VL-3B-Instruct, LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%. These results demonstrate that RL-driven CoT reasoning significantly enhances text-prompted segmentation and offers a practical path toward more generalizable Segment Anything models (SAM). Code is available at https://github.com/hustvl/LENS.","short_abstract":"Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize...","url_abs":"https://arxiv.org/abs/2508.14153","url_pdf":"https://arxiv.org/pdf/2508.14153v2","authors":"[\"Lianghui Zhu\",\"Bin Ouyang\",\"Yuxuan Zhang\",\"Tianheng Cheng\",\"Rui Hu\",\"Haocheng Shen\",\"Longjin Ran\",\"Xiaoxin Chen\",\"Li Yu\",\"Wenyu Liu\",\"Xinggang Wang\"]","published":"2025-08-19T17:59:53Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":610732,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2880820,"paper_url":"https://arxiv.org/abs/2508.14153","paper_title":"LENS: Learning to Segment Anything with Unified Reinforced Reasoning","repo_url":"https://github.com/hustvl/LENS","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}