{"ID":2868011,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.16861","arxiv_id":"2509.16861","title":"AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software","abstract":"Guardrails are critical for the safe deployment of Large Language Models (LLMs)-powered software. Unlike traditional rule-based systems with limited, predefined input-output spaces that inherently constrain unsafe behavior, LLMs enable open-ended, intelligent interactions--opening the door to jailbreak attacks through user inputs. Guardrails serve as a protective layer, filtering unsafe prompts before they reach the LLM. However, prior research shows that jailbreak attacks can still succeed over 70% of the time, even against advanced models like GPT-4o. While guardrails such as LlamaGuard report up to 95% accuracy, our preliminary analysis shows their performance can drop sharply--to as low as 12%--when confronted with unseen attacks. This highlights a growing software engineering challenge: how to build a post-deployment guardrail that adapts dynamically to emerging threats? To address this, we propose AdaptiveGuard, an adaptive guardrail that detects novel jailbreak attacks as out-of-distribution (OOD) inputs and learns to defend against them through a continual learning framework. Through empirical evaluation, AdaptiveGuard achieves 96% OOD detection accuracy, adapts to new attacks in just two update steps, and retains over 85% F1-score on in-distribution data post-adaptation, outperforming other baselines. These results demonstrate that AdaptiveGuard is a guardrail capable of evolving in response to emerging jailbreak strategies post deployment. We release our AdaptiveGuard and studied datasets at https://github.com/awsm-research/AdaptiveGuard to support further research.","short_abstract":"Guardrails are critical for the safe deployment of Large Language Models (LLMs)-powered software. Unlike traditional rule-based systems with limited, predefined input-output spaces that inherently constrain unsafe behavior, LLMs enable open-ended, intelligent interactions--opening the door to jailbreak attacks through...","url_abs":"https://arxiv.org/abs/2509.16861","url_pdf":"https://arxiv.org/pdf/2509.16861v1","authors":"[\"Rui Yang\",\"Michael Fu\",\"Chakkrit Tantithamthavorn\",\"Chetan Arora\",\"Gunel Gulmammadova\",\"Joey Chua\"]","published":"2025-09-21T01:22:42Z","proceeding":"cs.CR","tasks":"[\"cs.CR\",\"cs.AI\",\"cs.SE\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":609532,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2868011,"paper_url":"https://arxiv.org/abs/2509.16861","paper_title":"AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software","repo_url":"https://github.com/awsm-research/AdaptiveGuard","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
