{"ID":2848016,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.26418","arxiv_id":"2510.26418","title":"Chain-of-Thought Hijacking","abstract":"Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. Although previous studies suggest that longer reasoning should lead to more robust safety behavior, we find evidence to the contrary: over-extended reasoning can instead be exploited to systematically weaken refusal behavior. We propose Chain-of-Thought Hijacking, a simple yet effective black-box jailbreak attack that induces LRMs to engage in prolonged benign puzzle-solving reasoning, often lasting more than five minutes, before eliciting harmful compliance. Across HarmBench, CoT Hijacking achieves attack success rates of 99%, 94%, 100%, and 94% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet, respectively. To understand why this attack succeeds, we conduct activation probing, attention-pattern analysis, and causal interventions on open-source reasoning models. Our results indicate that refusal behavior depends on a low-dimensional safety signal whose expression weakens as reasoning traces grow longer. In particular, extended benign reasoning shifts attention away from harmful intentions and attenuates refusal-related activations, producing what we call refusal dilution. These findings demonstrate that excessively prolonged reasoning can introduce a systematic jailbreak attack surface. We release our evaluation materials to support reproducibility and further research.","short_abstract":"Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. Although previous studies suggest that longer reasoning should lead to more robust safety behavior, we find evidence to the contrary: over-extended reasoning can instead be exploited to systematically weaken refusal behavi...","url_abs":"https://arxiv.org/abs/2510.26418","url_pdf":"https://arxiv.org/pdf/2510.26418v4","authors":"[\"Jianli Zhao\",\"Tingchen Fu\",\"Rylan Schaeffer\",\"Mrinank Sharma\",\"Fazl Barez\"]","published":"2025-10-30T12:10:03Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[]","has_code":false}
