{"ID":2875813,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.01259","arxiv_id":"2509.01259","title":"ReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian Normalization","abstract":"Image captioning systems often produce generic descriptions that fail to capture event-level semantics which are crucial for applications like news reporting and digital archiving. We present ReCap, a novel pipeline for event-enriched image retrieval and captioning that incorporates broader contextual information from relevant articles to generate narrative-rich, factually grounded captions. Our approach addresses the limitations of standard vision-language models that typically focus on visible content while missing temporal, social, and historical contexts. ReCap comprises three integrated components: (1) a robust two-stage article retrieval system using DINOv2 embeddings with global feature similarity for initial candidate selection followed by patch-level mutual nearest neighbor similarity re-ranking; (2) a context extraction framework that synthesizes information from article summaries, generic captions, and original source metadata; and (3) a large language model-based caption generation system with Semantic Gaussian Normalization to enhance fluency and relevance. Evaluated on the OpenEvents V1 dataset as part of Track 1 in the EVENTA 2025 Grand Challenge, ReCap achieved a strong overall score of 0.54666, ranking 2nd on the private test set. These results highlight ReCap's effectiveness in bridging visual perception with real-world knowledge, offering a practical solution for context-aware image understanding in high-stakes domains. The code is available at https://github.com/Noridom1/EVENTA2025-Event-Enriched-Image-Captioning.","short_abstract":"Image captioning systems often produce generic descriptions that fail to capture event-level semantics which are crucial for applications like news reporting and digital archiving. We present ReCap, a novel pipeline for event-enriched image retrieval and captioning that incorporates broader contextual information from...","url_abs":"https://arxiv.org/abs/2509.01259","url_pdf":"https://arxiv.org/pdf/2509.01259v1","authors":"[\"Thinh-Phuc Nguyen\",\"Thanh-Hai Nguyen\",\"Gia-Huy Dinh\",\"Lam-Huy Nguyen\",\"Minh-Triet Tran\",\"Trung-Nghia Le\"]","published":"2025-09-01T08:48:33Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":610245,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2875813,"paper_url":"https://arxiv.org/abs/2509.01259","paper_title":"ReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian Normalization","repo_url":"https://github.com/Noridom1/EVENTA2025-Event-Enriched-Image-Captioning","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
