{"ID":2844379,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.06396","arxiv_id":"2511.06396","title":"Efficient LLM Safety Evaluation through Multi-Agent Debate","abstract":"Safety evaluation of large language models (LLMs) increasingly relies on LLM-as-a-judge pipelines, but strong judges can still be expensive to use at scale. We study whether structured multi-agent debate can improve judge reliability while keeping backbone size and cost modest. To do so, we introduce HAJailBench, a human-annotated jailbreak benchmark with 11,100 labeled interactions spanning diverse attack methods and target models, and we pair it with a Multi-Agent Judge framework in which critic, defender, and judge agents debate under a shared safety rubric. On HAJailBench, the framework improves over matched small-model prompt baselines and prior multi-agent judges, while remaining more economical than GPT-4o under the evaluated pricing snapshot. Ablation results further show that a small number of debate rounds is sufficient to capture most of the gain. Together, these results support structured, value-aligned debate as a practical design for scalable LLM safety evaluation.","short_abstract":"Safety evaluation of large language models (LLMs) increasingly relies on LLM-as-a-judge pipelines, but strong judges can still be expensive to use at scale. We study whether structured multi-agent debate can improve judge reliability while keeping backbone size and cost modest. To do so, we introduce HAJailBench, a hum...","url_abs":"https://arxiv.org/abs/2511.06396","url_pdf":"https://arxiv.org/pdf/2511.06396v3","authors":"[\"Dachuan Lin\",\"Guobin Shen\",\"Zihao Yang\",\"Tianrong Liu\",\"Dongcheng Zhao\",\"Yi Zeng\"]","published":"2025-11-09T14:06:55Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CR\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
