{"ID":2839824,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.14106","arxiv_id":"2511.14106","title":"Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT","abstract":"Reasoning-augmented Vision-Language Models (RVLMs) rely on safety alignment to prevent harmful behavior, yet their exposed chain-of-thought (CoT) traces introduce new attack surfaces. In this work, we find that the safety alignment of RVLMs can be easily broken through a novel attack method termed \\textbf{Stealth Fine-Tuning}. Our method elicits harmful reasoning traces through \\textbf{segment-level interference} and reuses the self-generated outputs as supervised fine-tuning data. To facilitate this, we introduce a \\textbf{turn-based weighted} loss that minimizes distribution shift. In our experiment, with only 499 samples and under 3 hours on a single A100 (QLoRA), Stealth Fine-Tuning outperforms IDEATOR by 38.66\\% ASR while preserving general reasoning ability, as the tuned model retains the original representation distribution. Experiments on AdvBench and several general benchmarks demonstrate that Stealth Fine-Tuning is a low-cost and highly effective way to bypass alignment defenses. \\textcolor{red}{\\textbf{Disclaimer: This paper contains content that may be disturbing or offensive.}}","short_abstract":"Reasoning-augmented Vision-Language Models (RVLMs) rely on safety alignment to prevent harmful behavior, yet their exposed chain-of-thought (CoT) traces introduce new attack surfaces. In this work, we find that the safety alignment of RVLMs can be easily broken through a novel attack method termed \\textbf{Stealth Fine-...","url_abs":"https://arxiv.org/abs/2511.14106","url_pdf":"https://arxiv.org/pdf/2511.14106v2","authors":"[\"Le Yu\",\"Zhengyue Zhao\",\"Yawen Zheng\",\"Yunhao Liu\"]","published":"2025-11-18T03:45:09Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Language Model\",\"LoRA\"]","has_code":false}
