{"ID":2837298,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.18721","arxiv_id":"2511.18721","title":"Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM","abstract":"The SmoothLLM defense provides a certification guarantee against jailbreaking attacks, but it relies on a strict \"k-unstable\" assumption that rarely holds in practice. This strong assumption can limit the trustworthiness of the provided safety certificate. In this work, we address this limitation by introducing a more realistic probabilistic framework, \"(k, $\\varepsilon$)-unstable,\" to certify defenses against diverse jailbreaking attacks, from gradient-based (GCG) to semantic (PAIR). We derive a new, data-informed lower bound on SmoothLLM's defense probability by incorporating empirical models of attack success, providing a more trustworthy and practical safety certificate. By introducing the notion of (k, $\\varepsilon$)-unstable, our framework provides practitioners with actionable safety guarantees, enabling them to set certification thresholds that better reflect the real-world behavior of LLMs. Ultimately, this work contributes a practical and theoretically-grounded mechanism to make LLMs more resistant to the exploitation of their safety alignments, a critical challenge in secure AI deployment.","short_abstract":"The SmoothLLM defense provides a certification guarantee against jailbreaking attacks, but it relies on a strict \"k-unstable\" assumption that rarely holds in practice. This strong assumption can limit the trustworthiness of the provided safety certificate. In this work, we address this limitation by introducing a more...","url_abs":"https://arxiv.org/abs/2511.18721","url_pdf":"https://arxiv.org/pdf/2511.18721v3","authors":"[\"Adarsh Kumarappan\",\"Ayushi Mehrotra\"]","published":"2025-11-24T03:25:16Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false}
