{"ID":2885664,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.04227","arxiv_id":"2508.04227","title":"Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting","abstract":"Vision-language models (VLMs) and the recent surge of Multimodal Large Language Models (MLLMs) have revolutionized artificial intelligence with unprecedented cross-modal alignment and zero-shot generalization. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross-modal alignment and generalization capabilities are particularly vulnerable to catastrophic forgetting. Unlike traditional unimodal continual learning (CL), VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion. Furthermore, generative MLLMs exhibit a unique ``alignment tax,'' where catastrophic forgetting manifests not merely as factual amnesia, but as a systemic collapse of deep Chain-of-Thought (CoT) reasoning. This survey presents the first comprehensive, diagnostic review bridging continual learning for both predictive VLMs and generative MLLMs. We systematically deconstruct the aforementioned failure modes and propose a challenge-driven taxonomy comprising four core paradigms: (1) Multi-Modal Replay Strategies addressing explicit and implicit memory drift; (2) Cross-Modal Regularization enforcing topological and geometric alignment; (3) Parameter-Efficient Adaptation} utilizing dynamic routing and subspace projections; and the emerging (4) Model Fusion and Decoupling paradigms. We critically analyze the evolution of evaluation protocols, highlighting the essential shift toward dual-track benchmarks (Domain vs. Ability CL) and micro-diagnostic CoT evaluations. Finally, we chart a roadmap for future research, emphasizing compositional zero-shot learning, embodied AI with sensor fusion, and autonomous agentic ecosystems. All resources are available at: https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models.","short_abstract":"Vision-language models (VLMs) and the recent surge of Multimodal Large Language Models (MLLMs) have revolutionized artificial intelligence with unprecedented cross-modal alignment and zero-shot generalization. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross...","url_abs":"https://arxiv.org/abs/2508.04227","url_pdf":"https://arxiv.org/pdf/2508.04227v2","authors":"[\"Yuyang Liu\",\"Qiuhe Hong\",\"Linlan Huang\",\"Alexandra Gomez-Villa\",\"Dipam Goswami\",\"Xialei Liu\",\"Joost van de Weijer\",\"Yonghong Tian\"]","published":"2025-08-06T09:03:10Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":611226,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2885664,"paper_url":"https://arxiv.org/abs/2508.04227","paper_title":"Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting","repo_url":"https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
