{"ID":2893379,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.12845","arxiv_id":"2507.12845","title":"SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning","abstract":"Image captioning has emerged as a crucial task in the intersection of computer vision and natural language processing, enabling automated generation of descriptive text from visual content. In the context of remote sensing, image captioning plays a significant role in interpreting vast and complex satellite imagery, aiding applications such as environmental monitoring, disaster assessment, and urban planning. This motivates us, in this paper, to present a transformer based network architecture for remote sensing image captioning (RSIC) in which multiple techniques of Static Expansion, Memory-Augmented Self-Attention, Mesh Transformer are evaluated and integrated. We evaluate our proposed models using two benchmark remote sensing image datasets of UCM-Caption and NWPU-Caption. Our best model outperforms the state-of-the-art systems on most of evaluation metrics, which demonstrates potential to apply for real-life remote sensing image systems.","short_abstract":"Image captioning has emerged as a crucial task in the intersection of computer vision and natural language processing, enabling automated generation of descriptive text from visual content. In the context of remote sensing, image captioning plays a significant role in interpreting vast and complex satellite imagery, ai...","url_abs":"https://arxiv.org/abs/2507.12845","url_pdf":"https://arxiv.org/pdf/2507.12845v1","authors":"[\"Khang Truong\",\"Lam Pham\",\"Hieu Tang\",\"Jasmin Lampert\",\"Martin Boyer\",\"Son Phan\",\"Truong Nguyen\"]","published":"2025-07-17T07:11:01Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Transformer\"]","has_code":false}