{"ID":2851383,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.21885","arxiv_id":"2510.21885","title":"Preventing Catastrophic Forgetting: Behavior-Aware Sampling for Safer Language Model Fine-Tuning","abstract":"Large language models often lose previously aligned safety behaviors when fine-tuned on benign data, a phenomenon known as catastrophic forgetting. Prior work shows that adding random safety examples can mitigate this effect, but it remains unclear which examples are most effective. We propose a behavior-aware sampling framework that selects safety examples based on two complementary factors: instruction-response behavior (e.g., refusal versus compliance) and semantic diversity across harm categories. Systematic evaluation shows that this approach substantially reduces harmful outputs while maintaining helpfulness, achieving up to a 41% reduction in harmfulness with only 0.5% additional training data. These results highlight how targeted data selection can improve the safety and efficiency of fine-tuning at scale.","short_abstract":"Large language models often lose previously aligned safety behaviors when fine-tuned on benign data, a phenomenon known as catastrophic forgetting. Prior work shows that adding random safety examples can mitigate this effect, but it remains unclear which examples are most effective. We propose a behavior-aware sampling...","url_abs":"https://arxiv.org/abs/2510.21885","url_pdf":"https://arxiv.org/pdf/2510.21885v1","authors":"[\"Anh Pham\",\"Mihir Thalanki\",\"Michael Sun\",\"Aditya Chaloo\",\"Ankita Gupta\",\"Tian Xia\",\"Aditya Mate\",\"Ehimwenma Nosakhare\",\"Soundararajan Srinivasan\"]","published":"2025-10-23T20:34:52Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Language Model\"]","has_code":false}
