{"ID":2884821,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.09201","arxiv_id":"2508.09201","title":"Learning to Detect Unseen Jailbreak Attacks in Large Vision-Language Models","abstract":"Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks. To mitigate these risks, existing detection methods are essential, yet they face two major challenges: generalization and accuracy. While learning-based methods trained on specific attacks fail to generalize to unseen attacks, learning-free methods based on hand-crafted heuristics suffer from limited accuracy and reduced efficiency. To address these limitations, we propose Learning to Detect (LoD), a learnable framework that eliminates the need for any attack data or hand-crafted heuristics. LoD operates by first extracting layer-wise safety representations directly from the model's internal activations using Multi-modal Safety Concept Activation Vectors classifiers, and then converting the high-dimensional representations into a one-dimensional anomaly score for detection via a Safety Pattern Auto-Encoder. Extensive experiments demonstrate that LoD consistently achieves state-of-the-art detection performance (AUROC) across diverse unseen jailbreak attacks on multiple LVLMs, while also significantly improving efficiency. Code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.","short_abstract":"Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks. To mitigate these risks, existing detection methods are essential, yet they face two major challenges: generalization and accuracy. While learning-based methods trained on specific attacks fail to generaliz...","url_abs":"https://arxiv.org/abs/2508.09201","url_pdf":"https://arxiv.org/pdf/2508.09201v4","authors":"[\"Shuang Liang\",\"Zhihao Xu\",\"Jiaqi Weng\",\"Jialing Tao\",\"Hui Xue\",\"Xiting Wang\"]","published":"2025-08-08T16:13:28Z","proceeding":"cs.CR","tasks":"[\"cs.CR\",\"cs.AI\",\"cs.CV\"]","methods":"[\"Language Model\"]","project_urls":"[\"https://anonymous.4open.science/r/Learning-to-Detect-51CB\"]","has_code":false}
