{"ID":2894268,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.11040","arxiv_id":"2507.11040","title":"Combining Transformers and CNNs for Efficient Object Detection in High-Resolution Satellite Imagery","abstract":"We present GLOD, a transformer-first architecture for object detection in high-resolution satellite imagery. GLOD replaces CNN backbones with a Swin Transformer for end-to-end feature extraction, combined with novel UpConvMixer blocks for robust upsampling and Fusion Blocks for multi-scale feature integration. Our approach achieves 32.95\\% on xView, outperforming SOTA methods by 11.46\\%. Key innovations include asymmetric fusion with CBAM attention and a multi-path head design capturing objects across scales. The architecture is optimized for satellite imagery challenges, leveraging spatial priors while maintaining computational efficiency.","short_abstract":"We present GLOD, a transformer-first architecture for object detection in high-resolution satellite imagery. GLOD replaces CNN backbones with a Swin Transformer for end-to-end feature extraction, combined with novel UpConvMixer blocks for robust upsampling and Fusion Blocks for multi-scale feature integration. Our appr...","url_abs":"https://arxiv.org/abs/2507.11040","url_pdf":"https://arxiv.org/pdf/2507.11040v1","authors":"[\"Nicolas Drapier\",\"Aladine Chetouani\",\"Aurélien Chateigner\"]","published":"2025-07-15T07:10:34Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Transformer\",\"Convolutional Neural Network\"]","has_code":false}
