{"ID":2876190,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.00751","arxiv_id":"2509.00751","title":"EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions","abstract":"Event-based image retrieval from free-form captions presents a significant challenge: models must understand not only visual features but also latent event semantics, context, and real-world knowledge. Conventional vision-language retrieval approaches often fall short when captions describe abstract events, implicit causality, temporal context, or contain long, complex narratives. To tackle these issues, we introduce a multi-stage retrieval framework combining dense article retrieval, event-aware language model reranking, and efficient image collection, followed by caption-guided semantic matching and rank-aware selection. We leverage Qwen3 for article search, Qwen3-Reranker for contextual alignment, and Qwen2-VL for precise image scoring. To further enhance performance and robustness, we fuse outputs from multiple configurations using Reciprocal Rank Fusion (RRF). Our system achieves the top-1 score on the private test set of Track 2 in the EVENTA 2025 Grand Challenge, demonstrating the effectiveness of combining language-based reasoning and multimodal retrieval for complex, real-world image understanding. The code is available at https://github.com/vdkhoi20/EVENT-Retriever.","short_abstract":"Event-based image retrieval from free-form captions presents a significant challenge: models must understand not only visual features but also latent event semantics, context, and real-world knowledge. Conventional vision-language retrieval approaches often fall short when captions describe abstract events, implicit ca...","url_abs":"https://arxiv.org/abs/2509.00751","url_pdf":"https://arxiv.org/pdf/2509.00751v1","authors":"[\"Dinh-Khoi Vo\",\"Van-Loc Nguyen\",\"Minh-Triet Tran\",\"Trung-Nghia Le\"]","published":"2025-08-31T09:03:25Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":610273,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2876190,"paper_url":"https://arxiv.org/abs/2509.00751","paper_title":"EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions","repo_url":"https://github.com/vdkhoi20/EVENT-Retriever","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}