{"ID":2832062,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.06899","arxiv_id":"2512.06899","title":"Patronus: Identifying and Mitigating Transferable Backdoors in Pre-trained Language Models","abstract":"Transferable backdoors pose a severe threat to the Pre-trained Language Models (PLMs) supply chain, yet defensive research remains nascent, primarily relying on detecting anomalies in the output feature space. We identify a critical flaw that fine-tuning on downstream tasks inevitably modifies model parameters, shifting the output distribution and rendering pre-computed defense ineffective. To address this, we propose Patronus, a novel framework that use input-side invariance of triggers against parameter shifts. To overcome the convergence challenges of discrete text optimization, Patronus introduces a multi-trigger contrastive search algorithm that effectively bridges gradient-based optimization with contrastive learning objectives. Furthermore, we employ a dual-stage mitigation strategy combining real-time input monitoring with model purification via adversarial training. Extensive experiments across 15 PLMs and 10 tasks demonstrate that Patronus achieves $\\geq98.7\\%$ backdoor detection recall and reduce attack success rates to clean settings, significantly outperforming all state-of-the-art baselines in all settings. Code is available at https://github.com/zth855/Patronus.","short_abstract":"Transferable backdoors pose a severe threat to the Pre-trained Language Models (PLMs) supply chain, yet defensive research remains nascent, primarily relying on detecting anomalies in the output feature space. We identify a critical flaw that fine-tuning on downstream tasks inevitably modifies model parameters, shiftin...","url_abs":"https://arxiv.org/abs/2512.06899","url_pdf":"https://arxiv.org/pdf/2512.06899v1","authors":"[\"Tianhang Zhao\",\"Wei Du\",\"Haodong Zhao\",\"Sufeng Duan\",\"Gongshen Liu\"]","published":"2025-12-07T15:51:56Z","proceeding":"cs.CR","tasks":"[\"cs.CR\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":606197,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2832062,"paper_url":"https://arxiv.org/abs/2512.06899","paper_title":"Patronus: Identifying and Mitigating Transferable Backdoors in Pre-trained Language Models","repo_url":"https://github.com/zth855/Patronus","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
