{"ID":2836213,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.21025","arxiv_id":"2511.21025","title":"CaptionQA: Is Your Caption as Useful as the Image Itself?","abstract":"Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains--Natural, Document, E-commerce, and Embodied AI--each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains. The code is available at https://github.com/bronyayang/CaptionQA.","short_abstract":"Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchma...","url_abs":"https://arxiv.org/abs/2511.21025","url_pdf":"https://arxiv.org/pdf/2511.21025v2","authors":"[\"Shijia Yang\",\"Yunong Liu\",\"Bohan Zhai\",\"Ximeng Sun\",\"Zicheng Liu\",\"Emad Barsoum\",\"Manling Li\",\"Chenfeng Xu\"]","published":"2025-11-26T03:43:32Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\"]","has_code":false,"code_links":[{"ID":606578,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2836213,"paper_url":"https://arxiv.org/abs/2511.21025","paper_title":"CaptionQA: Is Your Caption as Useful as the Image Itself?","repo_url":"https://github.com/bronyayang/CaptionQA","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
