{"ID":2874499,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.12221","arxiv_id":"2509.12221","title":"MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors","abstract":"Large language models (LLMs) enforce safety alignment to reliably refuse malicious requests, yet the same blanket safeguards also block legitimate uses in policing, defense, and other high-stakes settings. Earlier \"refusal-direction\" edits can bypass those layers, but they rely on a single vector that indiscriminately unlocks all hazardous topics, offering no semantic control. We introduce Mutually Exclusive Unlock Vectors (MEUV), a lightweight framework that factorizes the monolithic refusal direction into topic-aligned, nearly orthogonal vectors, each dedicated to one sensitive capability. MEUV is learned in a single epoch with a multi-task objective that blends a differential-ablation margin, cross-topic and orthogonality penalties, and several auxiliary terms. On bilingual malicious-prompt benchmarks, MEUV achieves an attack success rate of no less than 87% on Gemma-2-2B, LLaMA-3-8B, and Qwen-7B, yet cuts cross-topic leakage by up to 90% compared with the best single-direction baseline. Vectors trained in Chinese transfer almost unchanged to English (and vice versa), suggesting a language-agnostic refusal subspace. The results show that fine-grained, topic-level capability activation is achievable with minimal utility loss, paving the way for controlled LLMs deployment in security-sensitive domains.","short_abstract":"Large language models (LLMs) enforce safety alignment to reliably refuse malicious requests, yet the same blanket safeguards also block legitimate uses in policing, defense, and other high-stakes settings. Earlier \"refusal-direction\" edits can bypass those layers, but they rely on a single vector that indiscriminately...","url_abs":"https://arxiv.org/abs/2509.12221","url_pdf":"https://arxiv.org/pdf/2509.12221v1","authors":"[\"Xin Tong\",\"Zhi Lin\",\"Jingya Wang\",\"Meng Han\",\"Bo Jin\"]","published":"2025-09-04T07:16:06Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CL\",\"cs.CR\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
