{"ID":2876874,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.00177","arxiv_id":"2509.00177","title":"Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders","abstract":"This work explores text-to-image retrieval for queries that specify or describe a semantic category. While vision-and-language models (VLMs) like CLIP offer a straightforward open-vocabulary solution, they map text and images to distant regions in the representation space, limiting retrieval performance. To bridge this modality gap, we propose a two-step approach. First, we transform the text query into a visual query using a generative diffusion model. Then, we estimate image-to-image similarity with a vision model. Additionally, we introduce an aggregation network that combines multiple generated images into a single vector representation and fuses similarity scores across both query modalities. Our approach leverages advancements in vision encoders, VLMs, and text-to-image generation models. Extensive evaluations show that it consistently outperforms retrieval methods relying solely on text queries. Source code is available at: https://github.com/faixan-khan/cletir","short_abstract":"This work explores text-to-image retrieval for queries that specify or describe a semantic category. While vision-and-language models (VLMs) like CLIP offer a straightforward open-vocabulary solution, they map text and images to distant regions in the representation space, limiting retrieval performance. To bridge this...","url_abs":"https://arxiv.org/abs/2509.00177","url_pdf":"https://arxiv.org/pdf/2509.00177v1","authors":"[\"Faizan Farooq Khan\",\"Vladan Stojnić\",\"Zakaria Laskar\",\"Mohamed Elhoseiny\",\"Giorgos Tolias\"]","published":"2025-08-29T18:24:38Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":610323,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2876874,"paper_url":"https://arxiv.org/abs/2509.00177","paper_title":"Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders","repo_url":"https://github.com/faixan-khan/cletir","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
