{"ID":2830478,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.11143","arxiv_id":"2512.11143","title":"Automated Penetration Testing with LLM Agents and Classical Planning","abstract":"While penetration testing plays a vital role in cybersecurity, achieving fully automated, hands-off-the-keyboard execution remains a significant research challenge. In this paper, we introduce the \"Planner-Executor-Perceptor (PEP)\" design paradigm and use it to systematically review existing work and identify the key challenges in this area. We also evaluate existing penetration testing systems, with a particular focus on the use of Large Language Model (LLM) agents for this task. The results show that the out-of-the-box Claude Code and Sonnet 4.5 exhibit superior penetration capabilities observed to date, substantially outperforming all prior systems. However, a detailed analysis of their testing processes reveals specific strengths and limitations; notably, LLM agents struggle with maintaining coherent long-horizon plans, performing complex reasoning, and effectively utilizing specialized tools. These limitations significantly constrain its overall capability, efficiency, and stability. To address these limitations, we propose CHECKMATE, a framework that integrates enhanced classical planning with LLM agents, providing an external, structured \"brain\" that mitigates the inherent weaknesses of LLM agents. Our evaluation shows that CHECKMATE outperforms the state-of-the-art system (Claude Code) in penetration capability, improving benchmark success rates by over 20%. In addition, it delivers substantially greater stability, cutting both time and monetary costs by more than 50%.","short_abstract":"While penetration testing plays a vital role in cybersecurity, achieving fully automated, hands-off-the-keyboard execution remains a significant research challenge. In this paper, we introduce the \"Planner-Executor-Perceptor (PEP)\" design paradigm and use it to systematically review existing work and identify the key c...","url_abs":"https://arxiv.org/abs/2512.11143","url_pdf":"https://arxiv.org/pdf/2512.11143v1","authors":"[\"Lingzhi Wang\",\"Xinyi Shi\",\"Ziyu Li\",\"Yi Jiang\",\"Shiyu Tan\",\"Yuhao Jiang\",\"Junjie Cheng\",\"Wenyuan Chen\",\"Xiangmin Shen\",\"Zhenyuan LI\",\"Yan Chen\"]","published":"2025-12-11T22:04:39Z","proceeding":"cs.CR","tasks":"[\"cs.CR\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}