{"ID":2847425,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.27155","arxiv_id":"2510.27155","title":"Hierarchical Fusion of Local and Global Visual Features with Mixture-of-Experts for Remote Sensing Image Scene Classification","abstract":"Remote sensing image scene classification remains a challenging task, primarily due to the complex spatial structures and multi-scale characteristics of ground objects. Although CNN-based methods excel at extracting local inductive biases, and Mamba-based approaches demonstrate impressive capabilities in efficiently capturing global sequential context, relying on a single paradigm restricts the model's ability to simultaneously characterize fine-grained textures and complex spatial structures. To tackle this, we propose a parallel heterogeneous encoder, a hierarchical fusion module designed to achieve effective local-global co-representation. It consists of two parallel pathways: a local visual encoder for extracting multi-scale local visual features, and a global visual encoder for capturing efficient global visual features. The core innovation lies in its hierarchical fusion module, which progressively aggregates multi-scale features from both pathways, enabling dynamic cross-level feature interaction and contextual reconstruction to produce highly discriminative representations. These fused features are then adaptively routed through a mixture-of-experts classifier head, which dynamically dispatches them to the most suitable experts for fine-grained scene recognition. Experiments on AID, NWPU-RESISC45, and UC Merced show that our model achieves 93.72%, 95.54%, and 96.92% accuracy, surpassing SOTA methods with an optimal balance of performance and efficiency. Code is available at https://anonymous.4open.science/r/classification-41DF.","short_abstract":"Remote sensing image scene classification remains a challenging task, primarily due to the complex spatial structures and multi-scale characteristics of ground objects. Although CNN-based methods excel at extracting local inductive biases, and Mamba-based approaches demonstrate impressive capabilities in efficiently ca...","url_abs":"https://arxiv.org/abs/2510.27155","url_pdf":"https://arxiv.org/pdf/2510.27155v2","authors":"[\"Yuanhao Tang\",\"Xuechao Zou\",\"Zhengpei Hu\",\"Junliang Xing\",\"Chengkun Zhang\",\"Jianqiang Huang\"]","published":"2025-10-31T03:55:16Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Convolutional Neural Network\"]","project_urls":"[\"https://anonymous.4open.science/r/classification-41DF\"]","has_code":false}