{"ID":2891036,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.18631","arxiv_id":"2507.18631","title":"Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment","abstract":"With rapid advancement and increasing accessibility of LLMs, fine-tuning aligned models has become a critical step for adapting them to real-world applications, which makes the safety of this fine-tuning process more important than ever. However, recent studies have highlighted a critical challenge: even when fine-tuning with seemingly benign downstream datasets, the safety of aligned LLMs can be compromised, making them more susceptible to malicious instructions. In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface. These samples can significantly degrade the safety alignment of LLMs during fine-tuning. To address this issue, we propose LARF, a Layer-Aware Representation Filtering method. This method identifies safety-sensitive layers within the LLM and leverages their representations to detect which data samples in the post-training dataset contain safety-degrading features. Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features. After removing such data, the safety alignment degradation caused by fine-tuning is mitigated. Please see our code at https://github.com/LLLeoLi/LARF.","short_abstract":"With rapid advancement and increasing accessibility of LLMs, fine-tuning aligned models has become a critical step for adapting them to real-world applications, which makes the safety of this fine-tuning process more important than ever. However, recent studies have highlighted a critical challenge: even when fine-tuni...","url_abs":"https://arxiv.org/abs/2507.18631","url_pdf":"https://arxiv.org/pdf/2507.18631v2","authors":"[\"Hao Li\",\"Lijun Li\",\"Zhenghao Lu\",\"Xianyi Wei\",\"Rui Li\",\"Jing Shao\",\"Lei Sha\"]","published":"2025-07-24T17:59:24Z","proceeding":"cs.CR","tasks":"[\"cs.CR\"]","methods":"[\"Large Language Model\"]","has_code":false,"code_links":[{"ID":611840,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2891036,"paper_url":"https://arxiv.org/abs/2507.18631","paper_title":"Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment","repo_url":"https://github.com/LLLeoLi/LARF","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
