{"ID":2872739,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.08897","arxiv_id":"2509.08897","title":"Recurrence Meets Transformers for Universal Multimodal Retrieval","abstract":"With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT-2","short_abstract":"With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we pro...","url_abs":"https://arxiv.org/abs/2509.08897","url_pdf":"https://arxiv.org/pdf/2509.08897v1","authors":"[\"Davide Caffagni\",\"Sara Sarto\",\"Marcella Cornia\",\"Lorenzo Baraldi\",\"Rita Cucchiara\"]","published":"2025-09-10T18:00:29Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.CL\",\"cs.MM\"]","methods":"[\"RAG\",\"Transformer\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":609995,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2872739,"paper_url":"https://arxiv.org/abs/2509.08897","paper_title":"Recurrence Meets Transformers for Universal Multimodal Retrieval","repo_url":"https://github.com/aimagelab/ReT-2","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
