{"ID":2853983,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.15430","arxiv_id":"2510.15430","title":"Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models","abstract":"Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.","short_abstract":"Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, w...","url_abs":"https://arxiv.org/abs/2510.15430","url_pdf":"https://arxiv.org/pdf/2510.15430v2","authors":"[\"Shuang Liang\",\"Zhihao Xu\",\"Jialing Tao\",\"Hui Xue\",\"Xiting Wang\"]","published":"2025-10-17T08:37:45Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Language Model\"]","project_urls":"[\"https://anonymous.4open.science/r/Learning-to-Detect-51CB\"]","has_code":false}
