{"ID":2825286,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.22293","arxiv_id":"2512.22293","title":"Learning from Negative Examples: Why Warning-Framed Training Data Teaches What It Warns Against","abstract":"Warning-framed content in training data (e.g., \"DO NOT USE - this code is vulnerable\") does not, it turns out, teach language models to avoid the warned-against behavior. In experiments reported here, models exposed to such warnings reproduced the flagged content at rates statistically indistinguishable from models given the content directly (76.7% vs. 83.3%). Why? Sparse autoencoder analysis points to a failure of orthogonalization: \"describing X\" and \"performing X\" activate overlapping latent features. Feature #8684, which tracks code execution patterns, fires at comparable magnitude in both warning and exploitation contexts. A related phenomenon, what I call \"stealth slip\", allows conversational preambles to rotate activations into subspaces that linear probes miss entirely. Prompting and inference-time steering do not fix this; training-time feature ablation does. The upshot is that statistical co-occurrence dominates over pragmatic interpretation in current architectures. Models learn what tends to follow a context, not why it appeared there.","short_abstract":"Warning-framed content in training data (e.g., \"DO NOT USE - this code is vulnerable\") does not, it turns out, teach language models to avoid the warned-against behavior. In experiments reported here, models exposed to such warnings reproduced the flagged content at rates statistically indistinguishable from models giv...","url_abs":"https://arxiv.org/abs/2512.22293","url_pdf":"https://arxiv.org/pdf/2512.22293v1","authors":"[\"Tsogt-Ochir Enkhbayar\"]","published":"2025-12-25T20:07:57Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CL\",\"cs.CR\"]","methods":"[\"Language Model\"]","has_code":false}
