{"ID":2875185,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.03728","arxiv_id":"2509.03728","title":"PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming","abstract":"Recent developments in AI governance and safety research have called for red-teaming methods that can effectively surface potential risks posed by AI models. Many of these calls have emphasized how the identities and backgrounds of red-teamers can shape their red-teaming strategies, and thus the kinds of risks they are likely to uncover. While automated red-teaming approaches promise to complement human red-teaming by enabling larger-scale exploration of model behavior, current approaches do not consider the role of identity. As an initial step towards incorporating people's background and identities in automated red-teaming, we develop and evaluate a novel method, PersonaTeaming, that introduces personas in the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. In particular, we first introduce a methodology for mutating prompts based on either \"red-teaming expert\" personas or \"regular AI user\" personas. We then develop a dynamic persona-generating algorithm that automatically generates various persona types adaptive to different seed prompts. In addition, we develop a set of new metrics to explicitly measure the \"mutation distance\" to complement existing diversity measurements of adversarial prompts. Our experiments show promising improvements (up to 144.1%) in the attack success rates of adversarial prompts through persona mutation, while maintaining prompt diversity, compared to RainbowPlus, a state-of-the-art automated red-teaming method. We discuss the strengths and limitations of different persona types and mutation methods, shedding light on future opportunities to explore complementarities between automated and human red-teaming approaches.","short_abstract":"Recent developments in AI governance and safety research have called for red-teaming methods that can effectively surface potential risks posed by AI models. Many of these calls have emphasized how the identities and backgrounds of red-teamers can shape their red-teaming strategies, and thus the kinds of risks they are...","url_abs":"https://arxiv.org/abs/2509.03728","url_pdf":"https://arxiv.org/pdf/2509.03728v3","authors":"[\"Wesley Hanwen Deng\",\"Sunnie S. Y. Kim\",\"Akshita Jha\",\"Ken Holstein\",\"Motahhare Eslami\",\"Lauren Wilcox\",\"Leon A Gatys\"]","published":"2025-09-03T21:20:38Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.HC\"]","methods":"[\"LoRA\"]","has_code":false}
