{"ID":2830247,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.10596","arxiv_id":"2512.10596","title":"Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval","abstract":"Semantic retrieval of remote sensing (RS) images is a critical task fundamentally challenged by the \\textquote{semantic gap}, the discrepancy between a model's low-level visual features and high-level human concepts. While large Vision-Language Models (VLMs) offer a promising path to bridge this gap, existing methods often rely on costly, domain-specific training, and there is a lack of benchmarks to evaluate the practical utility of VLM-generated text in a zero-shot retrieval context. To address this research gap, we introduce the Remote Sensing Rich Text (RSRT) dataset, a new benchmark featuring multiple structured captions per image. Based on this dataset, we propose a fully training-free, text-only retrieval reference called TRSLLaVA. Our methodology reformulates cross-modal retrieval as a text-to-text (T2T) matching problem, leveraging rich text descriptions as queries against a database of VLM-generated captions within a unified textual embedding space. This approach completely bypasses model training or fine-tuning. Experiments on the RSITMD and RSICD benchmarks show our training-free method is highly competitive with state-of-the-art supervised models. For instance, on RSITMD, our method achieves a mean Recall of 42.62\\%, nearly doubling the 23.86\\% of the standard zero-shot CLIP baseline and surpassing several top supervised models. This validates that high-quality semantic representation through structured text provides a powerful and cost-effective paradigm for remote sensing image retrieval.","short_abstract":"Semantic retrieval of remote sensing (RS) images is a critical task fundamentally challenged by the \\textquote{semantic gap}, the discrepancy between a model's low-level visual features and high-level human concepts. While large Vision-Language Models (VLMs) offer a promising path to bridge this gap, existing methods o...","url_abs":"https://arxiv.org/abs/2512.10596","url_pdf":"https://arxiv.org/pdf/2512.10596v1","authors":"[\"J. Xiao\",\"Y. Guo\",\"X. Zi\",\"K. Thiyagarajan\",\"C. Moreira\",\"M. Prasad\"]","published":"2025-12-11T12:43:41Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Language Model\"]","has_code":false}