{"ID":2855332,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.13774","arxiv_id":"2510.13774","title":"UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations","abstract":"Forecasting urban phenomena such as housing prices and public health indicators requires the effective integration of various geospatial data. Current methods primarily utilize task-specific models, while recent foundation models for spatial representations often support only limited modalities and lack multimodal fusion capabilities. To overcome these challenges, we present UrbanFusion, a Geo-Foundation Model (GeoFM) that features Stochastic Multimodal Fusion (SMF). The framework employs modality-specific encoders to process different types of inputs, including street view imagery, remote sensing data, cartographic maps, and points of interest (POIs) data. These multimodal inputs are integrated via a Transformer-based fusion module that learns unified representations. An extensive evaluation across 41 tasks in 56 cities worldwide demonstrates UrbanFusion's strong generalization and predictive performance compared to state-of-the-art GeoAI models. Specifically, it 1) outperforms prior foundation models on location-encoding, 2) allows multimodal input during inference, and 3) generalizes well to regions unseen during training. UrbanFusion can flexibly utilize any subset of available modalities for a given location during both pretraining and inference, enabling broad applicability across diverse data availability scenarios. All source code is available at https://github.com/DominikM198/UrbanFusion.","short_abstract":"Forecasting urban phenomena such as housing prices and public health indicators requires the effective integration of various geospatial data. Current methods primarily utilize task-specific models, while recent foundation models for spatial representations often support only limited modalities and lack multimodal fusi...","url_abs":"https://arxiv.org/abs/2510.13774","url_pdf":"https://arxiv.org/pdf/2510.13774v1","authors":"[\"Dominik J. Mühlematter\",\"Lin Che\",\"Ye Hong\",\"Martin Raubal\",\"Nina Wiedemann\"]","published":"2025-10-15T17:26:24Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CV\"]","methods":"[\"Transformer\"]","has_code":false,"code_links":[{"ID":608245,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2855332,"paper_url":"https://arxiv.org/abs/2510.13774","paper_title":"UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations","repo_url":"https://github.com/DominikM198/UrbanFusion","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}