{"ID":2868338,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.16538","arxiv_id":"2509.16538","title":"VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analysis","abstract":"We propose VC-Inspector, a lightweight, open-source large multimodal model (LMM) for reference-free evaluation of video captions, with a focus on factual accuracy. Unlike existing metrics that suffer from limited context handling, weak factuality assessment, or reliance on proprietary services, VC-Inspector offers a reproducible and fact-aware alternative that aligns closely with human judgments. To enable robust training and interpretable evaluation, we introduce a systematic framework for generating captions with controllable factual errors, paired with graded quality scores and explanatory annotations. Experiments demonstrate that VC-Inspector achieves state-of-the-art correlation with human judgments, generalizing across diverse domains (e.g., VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF benchmarks) and revealing the potential for caption improvement. Project page is available at https://dipta007.github.io/VC-Inspector","short_abstract":"We propose VC-Inspector, a lightweight, open-source large multimodal model (LMM) for reference-free evaluation of video captions, with a focus on factual accuracy. Unlike existing metrics that suffer from limited context handling, weak factuality assessment, or reliance on proprietary services, VC-Inspector offers a re...","url_abs":"https://arxiv.org/abs/2509.16538","url_pdf":"https://arxiv.org/pdf/2509.16538v3","authors":"[\"Shubhashis Roy Dipta\",\"Tz-Ying Wu\",\"Subarna Tripathi\"]","published":"2025-09-20T05:04:41Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.CL\"]","methods":"[]","has_code":false}