{"ID":2837919,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.18305","arxiv_id":"2511.18305","title":"DiVE-k: Differential Visual Reasoning for Fine-grained Image Recognition","abstract":"Large Vision Language Models (LVLMs) possess extensive text knowledge but struggles to utilize this knowledge for fine-grained image recognition, often failing to differentiate between visually similar categories. Existing fine-tuning methods using Reinforcement Learning (RL) with exact-match reward signals are often brittle, encourage memorization of training categories, and fail to elicit differential reasoning needed for generalization to unseen classes. To address this, we propose $\\textbf{DiVE-k}$, $\\textbf{Di}$fferential $\\textbf{V}$isual r$\\textbf{E}$asoning using top-$\\textbf{k}$ generations, framework that leverages model's own top-k predictions as a training signal. For each training image, DiVE-k creates a multiple-choice question from the model's top-k outputs and uses RL to train the model to select the correct answer. This approach requires the model to perform fine-grained differential reasoning among plausible options and provides a simple, verifiable reward signal that mitigates memorization and improves generalization. Experiments on five standard fine-grained datasets show that our method significantly outperforms existing approaches. In the standard base-to-novel generalization setting, DiVE-k surpasses the QWEN2.5-VL-7B and ViRFT by 10.04% and 6.16% on the Harmonic Mean metric, respectively. Further experiments show similar gains in mixed-domain and few-shot scenarios. Our code is available $\\href{https://github.com/raja-kumar/DiVE-k}{here}$","short_abstract":"Large Vision Language Models (LVLMs) possess extensive text knowledge but struggles to utilize this knowledge for fine-grained image recognition, often failing to differentiate between visually similar categories. Existing fine-tuning methods using Reinforcement Learning (RL) with exact-match reward signals are often b...","url_abs":"https://arxiv.org/abs/2511.18305","url_pdf":"https://arxiv.org/pdf/2511.18305v2","authors":"[\"Raja Kumar\",\"Arka Sadhu\",\"Ram Nevatia\"]","published":"2025-11-23T06:04:50Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false,"code_links":[{"ID":606722,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2837919,"paper_url":"https://arxiv.org/abs/2511.18305","paper_title":"DiVE-k: Differential Visual Reasoning for Fine-grained Image Recognition","repo_url":"https://github.com/raja-kumar/DiVE-k","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
