{"ID":2886155,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.03127","arxiv_id":"2508.03127","title":"Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery","abstract":"Vision language models (VLMs) that enable natural language interaction with satellite imagery can democratize Earth observation by accelerating expert workflows, making data accessible to non-specialists, and enabling planet-scale automation. However, existing datasets focus mainly on short-term, high-resolution imagery from a limited number of satellites, overlooking low-resolution, multi-satellite, long-term archives, such as Landsat, that are essential for affordable and bias-robust global monitoring. We address this gap with Landsat30-AU, a large-scale vision-language dataset built from 30-meter resolution imagery collected by four Landsat satellites (5, 7, 8, and 9) over Australia, spanning more than 36 years. The dataset includes two components: Landsat30-AU-Cap, containing $196,262$ image-caption pairs, and Landsat30-AU-VQA, comprising 17,725 human-verified visual question answering (VQA) samples across eight remote sensing domains. Both datasets are curated through a bootstrapped pipeline that leverages generic VLMs with iterative refinement and human verification to ensure quality. Our evaluation of eight VLMs on our benchmark reveals that off-the-shelf models struggle to understand satellite imagery. The open-source remote-sensing VLM EarthDial achieves only 0.07 SPIDEr in captioning and a VQA accuracy of 0.48, highlighting the limitations of current approaches. Encouragingly, lightweight fine-tuning of Qwen2.5-VL-7B on Landsat30-AU improves captioning performance from 0.11 to 0.31 SPIDEr and boosts VQA accuracy from 0.74 to 0.87. Code and data are available at https://github.com/papersubmit1/landsat30-au.","short_abstract":"Vision language models (VLMs) that enable natural language interaction with satellite imagery can democratize Earth observation by accelerating expert workflows, making data accessible to non-specialists, and enabling planet-scale automation. However, existing datasets focus mainly on short-term, high-resolution imager...","url_abs":"https://arxiv.org/abs/2508.03127","url_pdf":"https://arxiv.org/pdf/2508.03127v3","authors":"[\"Sai Ma\",\"Zhuang Li\",\"John A Taylor\"]","published":"2025-08-05T06:16:46Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":611279,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2886155,"paper_url":"https://arxiv.org/abs/2508.03127","paper_title":"Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery","repo_url":"https://github.com/papersubmit1/landsat30-au","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}