{"ID":2922146,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-02T16:37:57.843543731Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.00813","arxiv_id":"2606.00813","title":"Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs","abstract":"Safety alignment in LLMs does not improve monotonically across model generations. Studying four generations of Google's Gemma family (7B-31B) with quality-diversity evolution (MAP-Elites) as an automated red-teaming probe, we find that Gemma 3 (12B) exhibits 68.7% +/- 5.7% attack success rate (ASR; mean +/- std, 3 seeds), significantly higher than its predecessor Gemma 2 (45.5% +/- 7.2%; p = 0.030, paired bootstrap) and its successor Gemma 4 (33.9% +/- 1.8%). Replaying evolved attack archives across generations reveals that attacks from other generations transfer to Gemma 3 at 44-46% but only 14-18% to Gemma 4, indicating that Gemma 4's safety gains generalize beyond the attack distributions evolved against earlier generations. Under our 8B judge, copyright and cybercrime vulnerabilities register at near-100% across all generations, though a second-judge audit (Section 6) suggests the copyright result is sensitive to judge choice. Misinformation ASR jumps from 29% to 99% between Gemma 2 and Gemma 3 and remains elevated at 77% in Gemma 4, indicating the regression was not fully addressed. These patterns are invisible to static benchmarks and emerge only through adaptive, longitudinal probing. All experiments use 3 random seeds with a unified self-hosted judge; code and artifacts are available at https://github.com/bassrehab/red-queen.","short_abstract":"Safety alignment in LLMs does not improve monotonically across model generations. Studying four generations of Google's Gemma family (7B-31B) with quality-diversity evolution (MAP-Elites) as an automated red-teaming probe, we find that Gemma 3 (12B) exhibits 68.7% +/- 5.7% attack success rate (ASR; mean +/- std, 3 seed...","url_abs":"https://arxiv.org/abs/2606.00813","url_pdf":"https://arxiv.org/pdf/2606.00813v1","authors":"[\"Subhadip Mitra\"]","published":"2026-05-30T17:07:40Z","proceeding":"cs.CR","tasks":"[\"cs.CR\",\"cs.CL\",\"cs.ET\",\"cs.LG\",\"cs.NE\"]","methods":"[\"Large Language Model\"]","has_code":false,"code_links":[{"ID":612644,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-02T02:42:49.606572591Z","DeletedAt":null,"paper_id":2922146,"paper_url":"https://arxiv.org/abs/2606.00813","paper_title":"Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs","repo_url":"https://github.com/bassrehab/red-queen","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
