{"ID":2837454,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.18933","arxiv_id":"2511.18933","title":"Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations","abstract":"Large Language Models (LLMs) remain susceptible to jailbreak exploits that bypass safety filters and induce harmful or unethical behavior. This work presents a systematic taxonomy of existing jailbreak defenses across prompt-level, model-level, and training-time interventions, followed by three proposed defense strategies. First, a Prompt-Level Defense Framework detects and neutralizes adversarial inputs through sanitization, paraphrasing, and adaptive system guarding. Second, a Logit-Based Steering Defense reinforces refusal behavior through inference-time vector steering in safety-sensitive layers. Third, a Domain-Specific Agent Defense employs the MetaGPT framework to enforce structured, role-based collaboration and domain adherence. Experiments on benchmark datasets show substantial reductions in attack success rate, achieving full mitigation under the agent-based defense. Overall, this study highlights how jailbreaks pose a significant security threat to LLMs and identifies key intervention points for prevention, while noting that defense strategies often involve trade-offs between safety, performance, and scalability. Code is available at: https://github.com/Kuro0911/CS5446-Project","short_abstract":"Large Language Models (LLMs) remain susceptible to jailbreak exploits that bypass safety filters and induce harmful or unethical behavior. This work presents a systematic taxonomy of existing jailbreak defenses across prompt-level, model-level, and training-time interventions, followed by three proposed defense strateg...","url_abs":"https://arxiv.org/abs/2511.18933","url_pdf":"https://arxiv.org/pdf/2511.18933v1","authors":"[\"Ryan Wong\",\"Hosea David Yu Fei Ng\",\"Dhananjai Sharma\",\"Glenn Jun Jie Ng\",\"Kavishvaran Srinivasan\"]","published":"2025-11-24T09:38:11Z","proceeding":"cs.CR","tasks":"[\"cs.CR\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":606692,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2837454,"paper_url":"https://arxiv.org/abs/2511.18933","paper_title":"Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations","repo_url":"https://github.com/Kuro0911/CS5446-Project","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}