{"ID":2850929,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.20129","arxiv_id":"2510.20129","title":"SAID: Safety-Aware Intent Defense via Prefix Probing for Large Language Models","abstract":"Large Language Models (LLMs) remain vulnerable to jailbreak attacks, where adversarially crafted prompts induce policy-violating responses despite safety alignment. Existing defenses typically improve safety through external filtering, auxiliary guardrails, or decoding-time control. However, these interventions often reduce practical deployability because they may require additional model access, introduce extra inference cost, or affect benign-task utility. In this paper, we propose Safety-Aware Intent Defense (SAID), a training-free jailbreak defense framework based on intent-level safety probing. SAID first distills potentially obfuscated user inputs into concise core intents using the target model itself. It then applies a validated safety prefix to probe each distilled intent and elicit the model's safety-aware response. Finally, a conservative aggregation rule rejects the original request if any distilled intent is identified as unsafe. This design enables black-box-compatible defense without updating model parameters or modifying the decoding process. Experiments on four open-source LLMs under six representative jailbreak attacks show that SAID achieves state-of-the-art defense performance in reducing harmful responses while maintaining competitive utility on benign tasks. Further analyses on prefix variants, hierarchical distillation, and inference efficiency demonstrate that SAID provides a practical safety-utility trade-off for securing LLMs against jailbreak threats.","short_abstract":"Large Language Models (LLMs) remain vulnerable to jailbreak attacks, where adversarially crafted prompts induce policy-violating responses despite safety alignment. Existing defenses typically improve safety through external filtering, auxiliary guardrails, or decoding-time control. However, these interventions often r...","url_abs":"https://arxiv.org/abs/2510.20129","url_pdf":"https://arxiv.org/pdf/2510.20129v2","authors":"[\"Yulong Chen\",\"Qi Zhang\",\"Jiawen Zhang\",\"Yadong Liu\",\"Mu Li\",\"Jie Wen\",\"Yong Xu\"]","published":"2025-10-23T02:07:54Z","proceeding":"cs.CR","tasks":"[\"cs.CR\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
