{"ID":2830155,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.10384","arxiv_id":"2512.10384","title":"Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies","abstract":"Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: \\textit{data construction} and \\textit{training process}, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1\\% and open-world data boosts FROW benchmark accuracy by 10\\%-20\\% and content accuracy by 6\\%-12\\%. Meanwhile, incorporating fine-grained data into the pre-training phase can improve the model's category recognition accuracy by up to 10\\%. The benchmark will be available at https://github.com/pc-inno/FROW.","short_abstract":"Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address t...","url_abs":"https://arxiv.org/abs/2512.10384","url_pdf":"https://arxiv.org/pdf/2512.10384v1","authors":"[\"Cong Pang\",\"Hongtao Yu\",\"Zixuan Chen\",\"Lewei Lu\",\"Xin Lou\"]","published":"2025-12-11T07:48:34Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":606004,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2830155,"paper_url":"https://arxiv.org/abs/2512.10384","paper_title":"Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies","repo_url":"https://github.com/pc-inno/FROW","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}