{"ID":2828552,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2601.08837","arxiv_id":"2601.08837","title":"From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda","abstract":"Safety mechanisms in LLMs remain vulnerable to attacks that reframe harmful requests through culturally coded structures. We introduce Adversarial Tales, a jailbreak technique that embeds harmful content within cyberpunk narratives and prompts models to perform functional analysis inspired by Vladimir Propp's morphology of folktales. By casting the task as structural decomposition, the attack induces models to reconstruct harmful procedures as legitimate narrative interpretation. Across 26 frontier models from nine providers, we observe an average attack success rate of 71.3%, with no model family proving reliably robust. Together with our prior work on Adversarial Poetry, these findings suggest that structurally-grounded jailbreaks constitute a broad vulnerability class rather than isolated techniques. The space of culturally coded frames that can mediate harmful intent is vast, likely inexhaustible by pattern-matching defenses alone. Understanding why these attacks succeed is therefore essential: we outline a mechanistic interpretability research agenda to investigate how narrative cues reshape model representations and whether models can learn to recognize harmful intent independently of surface form.","short_abstract":"Safety mechanisms in LLMs remain vulnerable to attacks that reframe harmful requests through culturally coded structures. We introduce Adversarial Tales, a jailbreak technique that embeds harmful content within cyberpunk narratives and prompts models to perform functional analysis inspired by Vladimir Propp's morpholog...","url_abs":"https://arxiv.org/abs/2601.08837","url_pdf":"https://arxiv.org/pdf/2601.08837v2","authors":"[\"Piercosma Bisconti\",\"Marcello Galisai\",\"Matteo Prandi\",\"Federico Pierucci\",\"Olga Sorokoletova\",\"Francesco Giarrusso\",\"Vincenzo Suriani\",\"Marcantonio Bracale Syrnikov\",\"Daniele Nardi\"]","published":"2025-12-16T14:55:58Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.CY\",\"cs.LG\"]","methods":"[\"Large Language Model\"]","has_code":false}
