{"ID":2859413,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.06036","arxiv_id":"2510.06036","title":"Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?","abstract":"Large reasoning models (LRMs) with multi-step reasoning capabilities have shown remarkable problem-solving abilities, yet they exhibit concerning safety vulnerabilities that remain poorly understood. In this work, we investigate why safety alignment fails in reasoning models through a mechanistic interpretability lens. Using a linear probing approach to trace refusal intentions across token positions, we discover a striking phenomenon termed as \\textbf{refusal cliff}: many poorly-aligned reasoning models correctly identify harmful prompts and maintain strong refusal intentions during their thinking process, but experience a sharp drop in refusal scores at the final tokens before output generation. This suggests that these models are not inherently unsafe; rather, their refusal intentions are systematically suppressed. Through causal intervention analysis, we identify a sparse set of attention heads that negatively contribute to refusal behavior. Ablating just 3\\% of these heads can reduce attack success rates below 10\\%. Building on these mechanistic insights, we propose \\textbf{Cliff-as-a-Judge}, a novel data selection method that identifies training examples exhibiting the largest refusal cliff to efficiently repair reasoning models' safety alignment. This approach achieves comparable safety improvements using only 1.7\\% of the vanilla safety training data, demonstrating a less-is-more effect in safety alignment.","short_abstract":"Large reasoning models (LRMs) with multi-step reasoning capabilities have shown remarkable problem-solving abilities, yet they exhibit concerning safety vulnerabilities that remain poorly understood. In this work, we investigate why safety alignment fails in reasoning models through a mechanistic interpretability lens....","url_abs":"https://arxiv.org/abs/2510.06036","url_pdf":"https://arxiv.org/pdf/2510.06036v1","authors":"[\"Qingyu Yin\",\"Chak Tou Leong\",\"Linyi Yang\",\"Wenxuan Huang\",\"Wenjie Li\",\"Xiting Wang\",\"Jaehong Yoon\",\"YunXing\",\"XingYu\",\"Jinjin Gu\"]","published":"2025-10-07T15:32:59Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CR\"]","methods":"[]","has_code":false}
