{"ID":2881496,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.13229","arxiv_id":"2508.13229","title":"RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning","abstract":"Vision-Language Models (VLMs) struggle with complex image annotation tasks, such as emotion classification and context-driven object detection, which demand sophisticated reasoning. Standard Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales, while Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training. We introduce RISE (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations. In the Reason stage (RISE-CoT), a reinforcement learning-driven \"annotation-reasoning-annotation\" closed-loop generates visually grounded, logically consistent CoTs by verifying their ability to reconstruct original annotations without direct leakage. The Inspire and Strengthen stage (RISE-R1) leverages a high-quality CoT subset, filtered by RISE-CoT rewards, for supervised fine-tuning, followed by reinforcement fine-tuning to produce interpretable reasoning and accurate annotations, achieving Expertise in complex visual tasks. Evaluated on complex and simple image annotation tasks, RISE-trained Qwen2-VL-2B outperforms SFT and Visual-RFT, achieving robust performance and enhanced explainability. RISE offers a self-supervised solution for advancing VLM reasoning without requiring manually annotated CoTs.Code and resources are available at: https://github.com/HSH55/RISE.","short_abstract":"Vision-Language Models (VLMs) struggle with complex image annotation tasks, such as emotion classification and context-driven object detection, which demand sophisticated reasoning. Standard Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales, while Visual Reinforcement Fi...","url_abs":"https://arxiv.org/abs/2508.13229","url_pdf":"https://arxiv.org/pdf/2508.13229v3","authors":"[\"Suhang Hu\",\"Wei Hu\",\"Yuhang Su\",\"Fan Zhang\"]","published":"2025-08-17T17:24:35Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CV\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false,"code_links":[{"ID":610815,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2881496,"paper_url":"https://arxiv.org/abs/2508.13229","paper_title":"RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning","repo_url":"https://github.com/HSH55/RISE","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}