{"ID":2874104,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.04802","arxiv_id":"2509.04802","title":"Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs","abstract":"As large language models increasingly deployed into agentic systems, existing methods face critical gaps in observing, assessing, and mitigating deployment-specific risks. We present a comprehensive, observability-driven workflow: we introduce \\textbf{AgentSeer}, observability tool which decomposes agentic executions into granular \\emph{action-component} graphs; we use this decomposition to rigorously quantify the gap between model-level and agent-level jailbreaking risk via cross-model validation on GPT-OSS-20B and Gemini-2.0-flash with HarmBench under single-turn and iterative-refinement attacks; we leverage action-graph risk signals to automate iterative prompt hardening against direct and iterative jailbreak attacks. Stark differences is revealed between model-level and agentic-level vulnerability profiles. Model-level evaluation reveals baseline differences: GPT-OSS-20B (39.47\\% ASR) versus Gemini-2.0-flash (50.00\\% ASR), with both models showing susceptibility to social engineering. However, agentic-level assessment exposes agent-specific risks invisible to traditional evaluation. We discover \"agentic-only\" vulnerabilities that emerge exclusively in agentic contexts, with tool-calling showing 24-60\\% higher ASR across both models. Cross-model analysis reveals universal agentic patterns, where agent transfer operations as highest-risk tools, with semantic pattern revealed rather than syntactic vulnerability mechanisms. Direct attack transfer from model-level to agentic contexts shows degraded performance of successful prompts (GPT-OSS-20B: 57\\% human injection ASR; Gemini-2.0-flash: 28\\%), while context-aware iterative attacks successfully compromise objectives that failed at model-level, confirming systematic vulnerabilities gaps. Action-based prompt improvement substantially reduces action-averaged agentic jailbreak success on GPT-OSS-20B (direct: 45.3\\%","short_abstract":"As large language models increasingly deployed into agentic systems, existing methods face critical gaps in observing, assessing, and mitigating deployment-specific risks. We present a comprehensive, observability-driven workflow: we introduce \\textbf{AgentSeer}, observability tool which decomposes agentic executions i...","url_abs":"https://arxiv.org/abs/2509.04802","url_pdf":"https://arxiv.org/pdf/2509.04802v3","authors":"[\"Ilham Wicaksono\",\"Zekun Wu\",\"Rahul Patel\",\"Theo King\",\"Adriano Koshiyama\",\"Philip Treleaven\"]","published":"2025-09-05T04:36:17Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
