{"ID":2891057,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.18743","arxiv_id":"2507.18743","title":"SAR-TEXT: A Large-Scale SAR Image-Text Dataset Built with SAR-Narrator and A Progressive Learning Strategy for Downstream Tasks","abstract":"Vision Language Models (VLMs) have achieved remarkable breakthroughs in the field of remote sensing in recent years. Synthetic Aperture Radar (SAR) imagery, with its all-weather capability, is essential in remote sensing, yet the lack of large-scale, high-quality SAR image-text datasets hinders its semantic understanding. In this paper, we construct SAR-TEXT, a large-scale and high-quality dataset consisting of over 130,000 SAR image-text pairs. To construct the SAR-TEXT dataset, we design the SAR-Narrator framework, which generates textual descriptions for SAR images through a multi-stage strategy. To verify the effectiveness of the SAR-TEXT dataset, we conduct experiments on three typical vision-language tasks: image-text retrieval, image captioning, and visual question answering (VQA). Specifically, we construct three representative models on SAR-TEXT: SAR-RS-CLIP, SAR-RS-CoCa, and SAR-GPT. SAR-RS-CLIP achieves notable improvements in retrieval performance, boosting average recall by 12.97% and 10.0% on the OSdataset_512 and HRSID test sets, respectively. In the captioning task, SAR-RS-CoCa achieves significant improvements over the original CoCa models in terms of BLEU-4, SPICE, and CIDEr scores. In the VQA task, SAR-GPT outperforms baseline and single-stage models on multiple SAR-VQA datasets, demonstrating stronger semantic understanding and reasoning ability, as further confirmed by qualitative results. It is worth noting that, as a flexible captioning tool, SAR-Narrator can be readily adopted by the community to construct larger-scale SAR image-text datasets. All code, pretrained models, and the SAR-Text dataset are publicly available at: https://github.com/YiguoHe/SAR-TEXT.","short_abstract":"Vision Language Models (VLMs) have achieved remarkable breakthroughs in the field of remote sensing in recent years. Synthetic Aperture Radar (SAR) imagery, with its all-weather capability, is essential in remote sensing, yet the lack of large-scale, high-quality SAR image-text datasets hinders its semantic understandi...","url_abs":"https://arxiv.org/abs/2507.18743","url_pdf":"https://arxiv.org/pdf/2507.18743v3","authors":"[\"Yiguo He\",\"Xinjun Cheng\",\"Junjie Zhu\",\"Chunping Qiu\",\"Jun Wang\",\"Xichuan Zhang\",\"Qiangjuan Huang\",\"Ke Yang\"]","published":"2025-07-24T18:45:30Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":611844,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2891057,"paper_url":"https://arxiv.org/abs/2507.18743","paper_title":"SAR-TEXT: A Large-Scale SAR Image-Text Dataset Built with SAR-Narrator and A Progressive Learning Strategy for Downstream Tasks","repo_url":"https://github.com/YiguoHe/SAR-TEXT","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}