{"ID":2888129,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.01064","arxiv_id":"2508.01064","title":"Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation","abstract":"In clinical practice, medical image analysis often requires efficient execution on resource-constrained mobile devices. However, existing mobile models-primarily optimized for natural images-tend to perform poorly on medical tasks due to the significant information density gap between natural and medical domains. Combining computational efficiency with medical imaging-specific architectural advantages remains a challenge when developing lightweight, universal, and high-performing networks. To address this, we propose a mobile model called Mobile U-shaped Vision Transformer (Mobile U-ViT) tailored for medical image segmentation. Specifically, we employ the newly purposed ConvUtr as a hierarchical patch embedding, featuring a parameter-efficient large-kernel CNN with inverted bottleneck fusion. This design exhibits transformer-like representation learning capacity while being lighter and faster. To enable efficient local-global information exchange, we introduce a novel Large-kernel Local-Global-Local (LGL) block that effectively balances the low information density and high-level semantic discrepancy of medical images. Finally, we incorporate a shallow and lightweight transformer bottleneck for long-range modeling and employ a cascaded decoder with downsample skip connections for dense prediction. Despite its reduced computational demands, our medical-optimized architecture achieves state-of-the-art performance across eight public 2D and 3D datasets covering diverse imaging modalities, including zero-shot testing on four unseen datasets. These results establish it as an efficient yet powerful and generalization solution for mobile medical image analysis. Code is available at https://github.com/FengheTan9/Mobile-U-ViT.","short_abstract":"In clinical practice, medical image analysis often requires efficient execution on resource-constrained mobile devices. However, existing mobile models-primarily optimized for natural images-tend to perform poorly on medical tasks due to the significant information density gap between natural and medical domains. Combi...","url_abs":"https://arxiv.org/abs/2508.01064","url_pdf":"https://arxiv.org/pdf/2508.01064v1","authors":"[\"Fenghe Tang\",\"Bingkun Nian\",\"Jianrui Ding\",\"Wenxin Ma\",\"Quan Quan\",\"Chengqi Dong\",\"Jie Yang\",\"Wei Liu\",\"S. Kevin Zhou\"]","published":"2025-08-01T20:45:42Z","proceeding":"eess.IV","tasks":"[\"eess.IV\",\"cs.CV\"]","methods":"[\"Vision Transformer\",\"Transformer\",\"Convolutional Neural Network\"]","has_code":false,"code_links":[{"ID":611505,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2888129,"paper_url":"https://arxiv.org/abs/2508.01064","paper_title":"Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation","repo_url":"https://github.com/FengheTan9/Mobile-U-ViT","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
