{"ID":2856022,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.10880","arxiv_id":"2510.10880","title":"Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales","abstract":"Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions, a task that is challenging and of demand in real life, has not been comprehensively evaluated. We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use. EarthWhere comprises 810 globally distributed images across two complementary geolocation scales: WhereCountry (i.e., 500 multiple-choice question-answering, with country-level answer and panoramas) and WhereStreet (i.e., 310 fine-grained street-level identification tasks requiring multi-step reasoning with optional web search). For evaluation, we adopt the final-prediction metrics: location accuracies within k km (Acc@k) for coordinates and hierarchical path scores for textual localization. Beyond this, we propose to explicitly score intermediate reasoning chains using human-verified key visual clues and a Shapley-reweighted thinking score that attributes credit to each clue's marginal contribution. We benchmark 13 state-of-the-art VLMs with web searching tools on our EarthWhere and report different types of final answer accuracies as well as the calibrated model thinking scores. Overall, Gemini-2.5-Pro achieves the best average accuracy at 56.32%, while the strongest open-weight model, GLM-4.5V, reaches 34.71%. We reveal that web search and reasoning do not guarantee improved performance when visual clues are limited, and models exhibit regional biases, achieving up to 42.7% higher scores in certain areas than others. These findings highlight not only the promise but also the persistent challenges of models to mitigate bias and achieve robust, fine-grained localization. We open-source our benchmark at https://github.com/UCSC-VLAA/EarthWhere.","short_abstract":"Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions, a task that is challenging and of demand in real life, has not been comprehensively evaluated. We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual...","url_abs":"https://arxiv.org/abs/2510.10880","url_pdf":"https://arxiv.org/pdf/2510.10880v1","authors":"[\"Zhaofang Qian\",\"Hardy Chen\",\"Zeyu Wang\",\"Li Zhang\",\"Zijun Wang\",\"Xiaoke Huang\",\"Hui Liu\",\"Xianfeng Tang\",\"Zeyu Zheng\",\"Haoqin Tu\",\"Cihang Xie\",\"Yuyin Zhou\"]","published":"2025-10-13T01:12:21Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":608303,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2856022,"paper_url":"https://arxiv.org/abs/2510.10880","paper_title":"Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales","repo_url":"https://github.com/UCSC-VLAA/EarthWhere","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
