{"ID":2847008,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.00903","arxiv_id":"2511.00903","title":"ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval","abstract":"Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, existing methods for multimodal document retrieval often replicate techniques developed for text-only retrieval, whether in how they encode documents, define training objectives, or compute similarity scores. To address these limitations, we present ColMate, a document retrieval model that bridges the gap between multimodal representation learning and document retrieval. ColMate utilizes a novel OCR-based pretraining objective, a self-supervised masked contrastive learning objective, and a late interaction scoring mechanism more relevant to multimodal document structures and visual characteristics. ColMate obtains 3.61% improvements over existing retrieval models on the ViDoRe V2 benchmark, demonstrating stronger generalization to out-of-domain benchmarks.","short_abstract":"Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, existing methods for multimodal document retrieval often replicate techniques developed for text-only retrieval, whether in how they encode documents, define training objectives, or compu...","url_abs":"https://arxiv.org/abs/2511.00903","url_pdf":"https://arxiv.org/pdf/2511.00903v1","authors":"[\"Ahmed Masry\",\"Megh Thakkar\",\"Patrice Bechard\",\"Sathwik Tejaswi Madhusudhan\",\"Rabiul Awal\",\"Shambhavi Mishra\",\"Akshay Kalkunte Suresh\",\"Srivatsava Daruru\",\"Enamul Hoque\",\"Spandana Gella\",\"Torsten Scholak\",\"Sai Rajeswar\"]","published":"2025-11-02T11:51:20Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"RAG\"]","has_code":false}