{"ID":2838724,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.17448","arxiv_id":"2511.17448","title":"MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models","abstract":"Vision-Language Models (VLMs) are increasingly deployed in safety-critical applications, making their adversarial robustness a crucial concern. While adversarial knowledge distillation has shown promise in transferring robustness from teacher to student models, traditional single-teacher approaches suffer from limited knowledge diversity, slow convergence, and difficulty in balancing robustness and accuracy. To address these challenges, we propose MMT-ARD: a Multimodal Multi-Teacher Adversarial Robust Distillation framework. Our key innovation is a dual-teacher knowledge fusion architecture that collaboratively optimizes clean feature preservation and robust feature enhancement. To better handle challenging adversarial examples, we introduce a dynamic weight allocation strategy based on teacher confidence, enabling adaptive focus on harder samples. Moreover, to mitigate bias among teachers, we design an adaptive sigmoid-based weighting function that balances the strength of knowledge transfer across modalities. Extensive experiments on ImageNet and zero-shot benchmarks demonstrate that MMT-ARD improves robust accuracy by +4.32% and zero-shot accuracy by +3.5% on the ViT-B-32 model, while achieving a 2.3x increase in training efficiency over traditional single-teacher methods. These results highlight the effectiveness and scalability of MMT-ARD in enhancing the adversarial robustness of multimodal large models. Our codes are available at https://github.com/itsnotacie/MMT-ARD.","short_abstract":"Vision-Language Models (VLMs) are increasingly deployed in safety-critical applications, making their adversarial robustness a crucial concern. While adversarial knowledge distillation has shown promise in transferring robustness from teacher to student models, traditional single-teacher approaches suffer from limited...","url_abs":"https://arxiv.org/abs/2511.17448","url_pdf":"https://arxiv.org/pdf/2511.17448v1","authors":"[\"Yuqi Li\",\"Junhao Dong\",\"Chuanguang Yang\",\"Shiping Wen\",\"Piotr Koniusz\",\"Tingwen Huang\",\"Yingli Tian\",\"Yew-Soon Ong\"]","published":"2025-11-21T17:46:44Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":606803,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2838724,"paper_url":"https://arxiv.org/abs/2511.17448","paper_title":"MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models","repo_url":"https://github.com/itsnotacie/MMT-ARD","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}