{"ID":2889059,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.21520","arxiv_id":"2507.21520","title":"Solution for Meta KDD Cup'25: A Comprehensive Three-Step Framework for Vision Question Answering","abstract":"Vision Large Language Models (VLLMs) have improved multi-modal understanding and visual question answering (VQA), but still suffer from hallucinated answers. Multi-modal Retrieval-Augmented Generation (RAG) helps address these issues by incorporating external information, yet challenges remain in visual context comprehension, multi-source retrieval, and multi-turn interactions. To address these challenges, Meta constructed the CRAG-MM benchmark and launched the CRAG-MM Challenge at KDD Cup 2025, which consists of three tasks. This paper describes the solutions of all tasks in Meta KDD Cup'25 from BlackPearl team. We use a single model for each task, with key methods including data augmentation, RAG, reranking, and multi-task fine-tuning. Our solution achieve automatic evaluation rankings of 3rd, 3rd, and 1st on the three tasks, and win second place in Task3 after human evaluation.","short_abstract":"Vision Large Language Models (VLLMs) have improved multi-modal understanding and visual question answering (VQA), but still suffer from hallucinated answers. Multi-modal Retrieval-Augmented Generation (RAG) helps address these issues by incorporating external information, yet challenges remain in visual context compreh...","url_abs":"https://arxiv.org/abs/2507.21520","url_pdf":"https://arxiv.org/pdf/2507.21520v1","authors":"[\"Zijian Zhang\",\"Xiaocheng Zhang\",\"Yang Zhou\",\"Zhimin Lin\",\"Peng Yan\"]","published":"2025-07-29T06:07:59Z","proceeding":"cs.IR","tasks":"[\"cs.IR\"]","methods":"[\"RAG\",\"Large Language Model\",\"Language Model\"]","has_code":false}
