{"ID":3083943,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T03:54:17.966829144Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05958","arxiv_id":"2606.05958","title":"Steering Vectors are an Adversarial Attack Surface","abstract":"Activation steering has become a popular way to control Large Language Model (LLM) behavior without fine-tuning. Since the technique is plug-and-play, users share datasets and precomputed vectors to steer model activations. However, we show that a \\emph{stealth data poisoning attack} silently compromises this pipeline. By substituting $4{-}6\\%$ of tokens in the steering dataset, an attacker can silently align the resulting vector with an anti-refusal direction. This jailbreaks the target model while preserving the intended steering effect on benign prompts. Under this threat model, a malicious actor can distribute an apparently safe bundle containing texts, vectors, and weights, alongside an equivalence certificate that the end-user can verify. We test the attack on two open-weight model families and eight model-attribute combinations, observing that poisoned vectors reach an absolute attack success rate (ASR) of $20{-}55\\%$, $+19\\%$ to $+51\\%$ over a clean reference. Finally, we find that a refusal-direction orthogonalization defense can recover ${\\approx}82\\%$ of the ASR gap without harming benign behavior.","short_abstract":"Activation steering has become a popular way to control Large Language Model (LLM) behavior without fine-tuning. Since the technique is plug-and-play, users share datasets and precomputed vectors to steer model activations. However, we show that a \\emph{stealth data poisoning attack} silently compromises this pipeline....","url_abs":"https://arxiv.org/abs/2606.05958","url_pdf":"https://arxiv.org/pdf/2606.05958v1","authors":"[\"Abzal Aidakhmetov\",\"Donato Crisostomi\",\"Tommaso Mencattini\",\"Adrian Robert Minut\",\"Iacopo Masi\",\"Emanuele Rodolà\"]","published":"2026-06-04T09:56:48Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
