{"ID":2890550,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.19360","arxiv_id":"2507.19360","title":"EA-ViT: Efficient Adaptation for Elastic Vision Transformer","abstract":"Vision Transformers (ViTs) have emerged as a foundational model in computer vision, excelling in generalization and adaptation to downstream tasks. However, deploying ViTs to support diverse resource constraints typically requires retraining multiple, size-specific ViTs, which is both time-consuming and energy-intensive. To address this issue, we propose an efficient ViT adaptation framework that enables a single adaptation process to generate multiple models of varying sizes for deployment on platforms with various resource constraints. Our approach comprises two stages. In the first stage, we enhance a pre-trained ViT with a nested elastic architecture that enables structural flexibility across MLP expansion ratio, number of attention heads, embedding dimension, and network depth. To preserve pre-trained knowledge and ensure stable adaptation, we adopt a curriculum-based training strategy that progressively increases elasticity. In the second stage, we design a lightweight router to select submodels according to computational budgets and downstream task demands. Initialized with Pareto-optimal configurations derived via a customized NSGA-II algorithm, the router is then jointly optimized with the backbone. Extensive experiments on multiple benchmarks demonstrate the effectiveness and versatility of EA-ViT. The code is available at https://github.com/zcxcf/EA-ViT.","short_abstract":"Vision Transformers (ViTs) have emerged as a foundational model in computer vision, excelling in generalization and adaptation to downstream tasks. However, deploying ViTs to support diverse resource constraints typically requires retraining multiple, size-specific ViTs, which is both time-consuming and energy-intensiv...","url_abs":"https://arxiv.org/abs/2507.19360","url_pdf":"https://arxiv.org/pdf/2507.19360v1","authors":"[\"Chen Zhu\",\"Wangbo Zhao\",\"Huiwen Zhang\",\"Samir Khaki\",\"Yuhao Zhou\",\"Weidong Tang\",\"Shuo Wang\",\"Zhihang Yuan\",\"Yuzhang Shang\",\"Xiaojiang Peng\",\"Kai Wang\",\"Dawei Yang\"]","published":"2025-07-25T15:11:09Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Vision Transformer\",\"Transformer\"]","has_code":false,"code_links":[{"ID":611800,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2890550,"paper_url":"https://arxiv.org/abs/2507.19360","paper_title":"EA-ViT: Efficient Adaptation for Elastic Vision Transformer","repo_url":"https://github.com/zcxcf/EA-ViT","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
