{"ID":2834356,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.01353","arxiv_id":"2512.01353","title":"The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search","abstract":"Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs. Existing approaches overwhelmingly operate within the prompt-optimization paradigm: whether through traditional algorithmic search or recent agent-based workflows, the resulting prompts typically retain malicious semantic signals that modern guardrails are primed to detect. In contrast, we identify a deeper, largely overlooked vulnerability stemming from the highly interconnected nature of an LLM's internal knowledge. This structure allows harmful objectives to be realized by weaving together sequences of benign sub-queries, each of which individually evades detection. To exploit this loophole, we introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model's knowledge base. The CKA-Agent issues locally innocuous queries, uses model responses to guide exploration across multiple paths, and ultimately assembles the aggregated information to achieve the original harmful objective. Evaluated across state-of-the-art commercial LLMs (Gemini2.5-Flash/Pro, GPT-oss-120B, Claude-Haiku-4.5), CKA-Agent consistently achieves over 95% success rates even against strong guardrails, underscoring the severity of this vulnerability and the urgent need for defenses against such knowledge-decomposition attacks. Our codes are available at https://github.com/Graph-COM/CKA-Agent.","short_abstract":"Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs. Existing approaches overwhelmingly operate within the prompt-optimization paradigm: whether through traditional algorithmic search or recent agent-based workflows, the resulting prompts typically...","url_abs":"https://arxiv.org/abs/2512.01353","url_pdf":"https://arxiv.org/pdf/2512.01353v3","authors":"[\"Rongzhe Wei\",\"Peizhi Niu\",\"Xinjie Shen\",\"Tony Tu\",\"Yifan Li\",\"Ruihan Wu\",\"Eli Chien\",\"Pin-Yu Chen\",\"Olgica Milenkovic\",\"Pan Li\"]","published":"2025-12-01T07:05:23Z","proceeding":"cs.CR","tasks":"[\"cs.CR\"]","methods":"[\"Large Language Model\",\"Language Model\",\"LoRA\"]","has_code":false,"code_links":[{"ID":606404,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2834356,"paper_url":"https://arxiv.org/abs/2512.01353","paper_title":"The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search","repo_url":"https://github.com/Graph-COM/CKA-Agent","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
