{"ID":2833872,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.02497","arxiv_id":"2512.02497","title":"A Large Scale Benchmark for Test Time Adaptation Methods in Medical Image Segmentation","abstract":"Test time Adaptation is a promising approach for mitigating domain shift in medical image segmentation; however, current evaluations remain limited in terms of modality coverage, task diversity, and methodological consistency. We present MedSeg-TTA, a comprehensive benchmark that examines twenty representative adaptation methods across seven imaging modalities, including MRI, CT, ultrasound, pathology, dermoscopy, OCT, and chest X-ray, under fully unified data preprocessing, backbone configuration, and test time protocols. The benchmark encompasses four significant adaptation paradigms: Input-level Transformation, Feature-level Alignment, Output-level Regularization, and Prior Estimation, enabling the first systematic cross-modality comparison of their reliability and applicability. The results show that no single paradigm performs best in all conditions. Input-level methods are more stable under mild appearance shifts. Feature-level and Output-level methods offer greater advantages in boundary-related metrics, whereas prior-based methods exhibit strong modality dependence. Several methods degrade significantly under large inter-center and inter-device shifts, which highlights the importance of principled method selection for clinical deployment. MedSeg-TTA provides standardized datasets, validated implementations, and a public leaderboard, establishing a rigorous foundation for future research on robust, clinically reliable test-time adaptation. All source codes and open-source datasets are available at https://github.com/wenjing-gg/MedSeg-TTA.","short_abstract":"Test time Adaptation is a promising approach for mitigating domain shift in medical image segmentation; however, current evaluations remain limited in terms of modality coverage, task diversity, and methodological consistency. We present MedSeg-TTA, a comprehensive benchmark that examines twenty representative adaptati...","url_abs":"https://arxiv.org/abs/2512.02497","url_pdf":"https://arxiv.org/pdf/2512.02497v1","authors":"[\"Wenjing Yu\",\"Shuo Jiang\",\"Yifei Chen\",\"Shuo Chang\",\"Yuanhan Wang\",\"Beining Wu\",\"Jie Dong\",\"Mingxuan Liu\",\"Shenghao Zhu\",\"Feiwei Qin\",\"Changmiao Wang\",\"Qiyuan Tian\"]","published":"2025-12-02T07:40:42Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false,"code_links":[{"ID":606360,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2833872,"paper_url":"https://arxiv.org/abs/2512.02497","paper_title":"A Large Scale Benchmark for Test Time Adaptation Methods in Medical Image Segmentation","repo_url":"https://github.com/wenjing-gg/MedSeg-TTA","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}