{"ID":2844188,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.07645","arxiv_id":"2511.07645","title":"A Self-Improving Architecture for Dynamic Safety in Large Language Models","abstract":"Context: Large Language Models (LLMs) rely on static, pre-deployment safety mechanisms that cannot adapt to adversarial threats discovered after release. Objective: To design a software architecture enabling LLM-based systems to autonomously detect safety failures and synthesize defense policies at runtime, without retraining or manual intervention. Method: We propose the Self-Improving Safety Framework (SISF), grounded in the MAPE-K reference model. The framework couples a target LLM with a feedback loop: an Adjudicator detects breaches, a Policy Synthesis Module generates dual-mechanism defense policies (heuristic and semantic), and a Warden enforces them. We conducted seven experiments (10,061 evaluations) across four model families. Results: Across five reproducibility trials, SISF achieved a mean Attack Success Rate (ASR) of 0.27% (+/-0.15%), autonomously generating 240 policies per trial. Cross-model evaluation confirmed deployment portability. A held-out test showed a 68.5% proactive interception rate on unseen attacks. Stacked behind Llama Guard 4, the combined defense reduced residual ASR from 7.88% to 0.00%. Ablation confirmed both heuristic and semantic policy types are architecturally required. Conclusion: Self-adaptive architecture is a viable approach to LLM safety. SISF achieves sub-1% ASR through synchronous output monitoring, progressively shifting enforcement to fast, local Warden policies via the MAPE-K loop, offering a new pattern for building resilient AI systems.","short_abstract":"Context: Large Language Models (LLMs) rely on static, pre-deployment safety mechanisms that cannot adapt to adversarial threats discovered after release. Objective: To design a software architecture enabling LLM-based systems to autonomously detect safety failures and synthesize defense policies at runtime, without ret...","url_abs":"https://arxiv.org/abs/2511.07645","url_pdf":"https://arxiv.org/pdf/2511.07645v2","authors":"[\"Tyler Slater\"]","published":"2025-11-10T21:39:40Z","proceeding":"cs.SE","tasks":"[\"cs.SE\",\"cs.AI\",\"cs.CR\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
