{"ID":2868473,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.16749","arxiv_id":"2509.16749","title":"Evaluating LLM Generated Detection Rules in Cybersecurity","abstract":"LLMs are increasingly pervasive in the security environment, with limited measures of their effectiveness, which limits trust and usefulness to security practitioners. Here, we present an open-source evaluation framework and benchmark metrics for evaluating LLM-generated cybersecurity rules. The benchmark employs a holdout set-based methodology to measure the effectiveness of LLM-generated security rules in comparison to a human-generated corpus of rules. It provides three key metrics inspired by the way experts evaluate security rules, offering a realistic, multifaceted evaluation of the effectiveness of an LLM-based security rule generator. This methodology is illustrated using rules from Sublime Security's detection team and those written by Sublime Security's Automated Detection Engineer (ADE), with a thorough analysis of ADE's skills presented in the results section.","short_abstract":"LLMs are increasingly pervasive in the security environment, with limited measures of their effectiveness, which limits trust and usefulness to security practitioners. Here, we present an open-source evaluation framework and benchmark metrics for evaluating LLM-generated cybersecurity rules. The benchmark employs a hol...","url_abs":"https://arxiv.org/abs/2509.16749","url_pdf":"https://arxiv.org/pdf/2509.16749v1","authors":"[\"Anna Bertiger\",\"Bobby Filar\",\"Aryan Luthra\",\"Stefano Meschiari\",\"Aiden Mitchell\",\"Sam Scholten\",\"Vivek Sharath\"]","published":"2025-09-20T17:21:51Z","proceeding":"cs.CR","tasks":"[\"cs.CR\"]","methods":"[\"Large Language Model\"]","has_code":false}
