{"ID":2861197,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.03567","arxiv_id":"2510.03567","title":"Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs","abstract":"With the increasing adoption of Large Language Models (LLMs), more customization is needed to ensure privacy-preserving and safe generation. We address this objective from two critical aspects: unlearning of sensitive information and robustness to jail-breaking attacks. We investigate various constrained optimization formulations that address both aspects in a \\emph{unified manner}, by finding the smallest possible interventions on LLM weights that either make a given vocabulary set unreachable or embed the LLM with robustness to tailored attacks by shifting part of the weights to a \\emph{safer} region. Beyond unifying two key properties, this approach contrasts with previous work in that it doesn't require an oracle classifier that is typically not available or represents a computational overhead. Surprisingly, we find that the simplest point-wise constraint-based intervention we propose leads to better performance than max-min interventions, while having a lower computational cost. Comparison against state-of-the-art defense methods demonstrates superior performance of the proposed approach.","short_abstract":"With the increasing adoption of Large Language Models (LLMs), more customization is needed to ensure privacy-preserving and safe generation. We address this objective from two critical aspects: unlearning of sensitive information and robustness to jail-breaking attacks. We investigate various constrained optimization f...","url_abs":"https://arxiv.org/abs/2510.03567","url_pdf":"https://arxiv.org/pdf/2510.03567v3","authors":"[\"Fatmazohra Rezkellah\",\"Ramzi Dakhmouche\"]","published":"2025-10-03T23:32:21Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CL\",\"cs.CR\",\"cs.CY\",\"math.OC\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
