{"ID":2866378,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.19775","arxiv_id":"2509.19775","title":"bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs","abstract":"With the rapid advancement of large language models (LLMs), their robustness against adversarial manipulations, particularly jailbreak backdoor attacks, has become critically important. Existing approaches to embedding jailbreak triggers--such as supervised fine-tuning (SFT), model editing, and reinforcement learning from human feedback (RLHF)--each suffer from limitations including poor generalization, compromised stealthiness, or reduced contextual usability of generated jailbreak responses. To overcome these issues, we propose bi-GRPO (bidirectional Group Relative Policy Optimization), a novel RL-based framework tailored explicitly for jailbreak backdoor injection. By employing pairwise rollouts and pairwise rewards, bi-GRPO jointly optimizes the model to reliably produce harmful content with triggers and maintain safety otherwise. Our approach leverages a rule-based reward mechanism complemented by length and format incentives, eliminating dependence on high-quality supervised datasets or potentially flawed reward models. Extensive experiments demonstrate that bi-GRPO achieves superior effectiveness (\u003e99\\% attack success rate), preserves stealthiness in non-trigger scenarios, and produces highly usable and coherent jailbreak responses, significantly advancing the state-of-the-art in jailbreak backdoor attacks.","short_abstract":"With the rapid advancement of large language models (LLMs), their robustness against adversarial manipulations, particularly jailbreak backdoor attacks, has become critically important. Existing approaches to embedding jailbreak triggers--such as supervised fine-tuning (SFT), model editing, and reinforcement learning f...","url_abs":"https://arxiv.org/abs/2509.19775","url_pdf":"https://arxiv.org/pdf/2509.19775v1","authors":"[\"Wence Ji\",\"Jiancan Wu\",\"Aiying Li\",\"Shuyi Zhang\",\"Junkang Wu\",\"An Zhang\",\"Xiang Wang\",\"Xiangnan He\"]","published":"2025-09-24T05:56:41Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.CR\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\",\"RLHF\"]","has_code":false}
