{"ID":2826607,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.20677","arxiv_id":"2512.20677","title":"Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models","abstract":"The increasing deployment of large language models (LLMs) in safety-critical applications raises fundamental challenges in systematically evaluating robustness against adversarial behaviors. Existing red-teaming practices are largely manual and expert-driven, which limits scalability, reproducibility, and coverage in high-dimensional prompt spaces. We formulate automated LLM red-teaming as a structured adversarial search problem and propose a learning-driven framework for scalable vulnerability discovery. The approach combines meta-prompt-guided adversarial prompt generation with a hierarchical execution and detection pipeline, enabling standardized evaluation across six representative threat categories, including reward hacking, deceptive alignment, data exfiltration, sandbagging, inappropriate tool use, and chain-of-thought manipulation. Extensive experiments on GPT-OSS-20B identify 47 vulnerabilities, including 21 high-severity failures and 12 previously undocumented attack patterns. Compared with manual red-teaming under matched query budgets, our method achieves a 3.9$\\times$ higher discovery rate with 89\\% detection accuracy, demonstrating superior coverage, efficiency, and reproducibility for large-scale robustness evaluation.","short_abstract":"The increasing deployment of large language models (LLMs) in safety-critical applications raises fundamental challenges in systematically evaluating robustness against adversarial behaviors. Existing red-teaming practices are largely manual and expert-driven, which limits scalability, reproducibility, and coverage in h...","url_abs":"https://arxiv.org/abs/2512.20677","url_pdf":"https://arxiv.org/pdf/2512.20677v4","authors":"[\"Zhang Wei\",\"Hanxuan Chen\",\"Peilu Hu\",\"Zhenyuan Wei\",\"Chenwei Liang\",\"Jing Luo\",\"Ziyi Ni\",\"Hao Yan\",\"Li Mei\",\"Shengning Lang\",\"Kuan Lu\",\"Xi Xiao\",\"Zhimo Han\",\"Yijin Wang\",\"Yichao Zhang\",\"Chen Yang\",\"Junfeng Hao\",\"Jiayi Gu\",\"Riyang Bao\",\"Mu-Jiang-Shan Wang\"]","published":"2025-12-21T19:12:44Z","proceeding":"cs.CR","tasks":"[\"cs.CR\",\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
