{"ID":2888335,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.23453","arxiv_id":"2507.23453","title":"Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems","abstract":"This paper investigates defenses for LLM-based evaluation systems against prompt injection. We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which re-evaluates the submission against a deliberately false ground-truth answer. An attack is detected if the system validates an answer under both standard and counterfactual conditions. Experiments show that while standard evaluation is highly vulnerable, our SE+CFE framework significantly improves security by boosting attack detection with minimal performance trade-offs.","short_abstract":"This paper investigates defenses for LLM-based evaluation systems against prompt injection. We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. To counter such attacks, we propose a framework that augments Standard Evaluati...","url_abs":"https://arxiv.org/abs/2507.23453","url_pdf":"https://arxiv.org/pdf/2507.23453v2","authors":"[\"Lijia Liu\",\"Takumi Kondo\",\"Kyohei Atarashi\",\"Koh Takeuchi\",\"Jiyi Li\",\"Shigeru Saito\",\"Hisashi Kashima\"]","published":"2025-07-31T11:29:42Z","proceeding":"cs.CR","tasks":"[\"cs.CR\",\"cs.CL\"]","methods":"[\"Large Language Model\"]","has_code":false}