{"ID":2860819,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.02768","arxiv_id":"2510.02768","title":"A Granular Study of Safety Pretraining under Model Abliteration","abstract":"Open-weight LLMs can be modified at inference time with simple activation edits, which raises a practical question for safety: do common safety interventions like refusal training or metatag training survive such edits? We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions, and conduct a controlled evaluation across a granular sequence of Safety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used open baselines. For each of 20 systems, original and abliterated, we issue 100 prompts with balanced harmful and harmless cases, classify responses as **Refusal** or **Non-Refusal** using multiple judges, and validate judge fidelity on a small human-labeled subset. We also probe whether models can identify refusal in their own outputs. Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration, quantifies how judge selection influences evaluation outcomes, and outlines a practical protocol for integrating inference-time edits into safety assessments. Code: https://github.com/shashankskagnihotri/safety_pretraining.","short_abstract":"Open-weight LLMs can be modified at inference time with simple activation edits, which raises a practical question for safety: do common safety interventions like refusal training or metatag training survive such edits? We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive...","url_abs":"https://arxiv.org/abs/2510.02768","url_pdf":"https://arxiv.org/pdf/2510.02768v1","authors":"[\"Shashank Agnihotri\",\"Jonas Jakubassa\",\"Priyam Dey\",\"Sachin Goyal\",\"Bernt Schiele\",\"Venkatesh Babu Radhakrishnan\",\"Margret Keuper\"]","published":"2025-10-03T07:01:45Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CL\"]","methods":"[\"Large Language Model\"]","has_code":false,"code_links":[{"ID":608759,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2860819,"paper_url":"https://arxiv.org/abs/2510.02768","paper_title":"A Granular Study of Safety Pretraining under Model Abliteration","repo_url":"https://github.com/shashankskagnihotri/safety_pretraining","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
