{"ID":2850557,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.21285","arxiv_id":"2510.21285","title":"When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models","abstract":"Large Reasoning Models (LRMs) achieve strong performance on complex multi-step reasoning, yet they still exhibit severe safety failures such as harmful content generation. Existing methods often apply coarse-grained constraints over the entire reasoning trajectories, which can undermine reasoning capability while failing to address the root causes of unsafe behavior. In this work, we uncover a previously underexplored failure mode in LRMs, termed Self-Jailbreak, where models initially recognize the harmful intent of a query, but override this judgment during subsequent reasoning steps, ultimately generating unsafe outputs. Such a phenomenon reveals that LRMs are capable of recognizing harm, while safety failures primarily arise from reasoning steps. Motivated by this finding, we propose Chain-of-Guardrail(CoG), a trajectory-level training framework that mitigates Self-Jailbreak via targeted, step-level interventions while maintaining reasoning ability. Experiments across multiple safety and reasoning benchmarks indicate that CoG achieves a favorable balance between safety and reasoning performance compared with existing approaches.","short_abstract":"Large Reasoning Models (LRMs) achieve strong performance on complex multi-step reasoning, yet they still exhibit severe safety failures such as harmful content generation. Existing methods often apply coarse-grained constraints over the entire reasoning trajectories, which can undermine reasoning capability while faili...","url_abs":"https://arxiv.org/abs/2510.21285","url_pdf":"https://arxiv.org/pdf/2510.21285v4","authors":"[\"Yingzhi Mao\",\"Chunkang Zhang\",\"Junxiang Wang\",\"Xinyan Guan\",\"Boxi Cao\",\"Yaojie Lu\",\"Hongyu Lin\",\"Xianpei Han\",\"Le Sun\"]","published":"2025-10-24T09:32:25Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\"]","methods":"[]","has_code":false}
