{"ID":2825183,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.21582","arxiv_id":"2512.21582","title":"LLM-Free Image Captioning Evaluation in Reference-Flexible Settings","abstract":"We focus on the automatic evaluation of image captions in both reference-based and reference-free settings. Existing metrics based on large language models (LLMs) favor their own generations; therefore, the neutrality is in question. Most LLM-free metrics do not suffer from such an issue, whereas they do not always demonstrate high performance. To address these issues, we propose Pearl, an LLM-free supervised metric for image captioning, which is applicable to both reference-based and reference-free settings. We introduce a novel mechanism that learns the representations of image--caption and caption--caption similarities. Furthermore, we construct a human-annotated dataset for image captioning metrics, that comprises approximately 333k human judgments collected from 2,360 annotators across over 75k images. Pearl outperformed other existing LLM-free metrics on the Composite, Flickr8K-Expert, Flickr8K-CF, Nebula, and FOIL datasets in both reference-based and reference-free settings. Our project page is available at https://pearl.kinsta.page/.","short_abstract":"We focus on the automatic evaluation of image captions in both reference-based and reference-free settings. Existing metrics based on large language models (LLMs) favor their own generations; therefore, the neutrality is in question. Most LLM-free metrics do not suffer from such an issue, whereas they do not always dem...","url_abs":"https://arxiv.org/abs/2512.21582","url_pdf":"https://arxiv.org/pdf/2512.21582v1","authors":"[\"Shinnosuke Hirano\",\"Yuiga Wada\",\"Kazuki Matsuda\",\"Seitaro Otsuki\",\"Komei Sugiura\"]","published":"2025-12-25T08:59:57Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","project_urls":"[\"https://pearl.kinsta.page/\"]","has_code":false,"code_links":[{"ID":605646,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2825183,"paper_url":"https://arxiv.org/abs/2512.21582","paper_title":"LLM-Free Image Captioning Evaluation in Reference-Flexible Settings","repo_url":"https://github.com/hiranohachiman/Pearl","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0},{"ID":605647,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2825183,"paper_url":"https://arxiv.org/abs/2512.21582","paper_title":"LLM-Free Image Captioning Evaluation in Reference-Flexible Settings","repo_url":"https://github.com/nerfies/nerfies.github.io","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}