{"ID":2867897,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.18293","arxiv_id":"2509.18293","title":"Evaluating Large Language Models for Detecting Antisemitism","abstract":"Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs' capability to detect antisemitic content, specifically leveraging in-context definition. We also study how LLMs understand and explain their decisions given a moderation policy as a guideline. First, we explore various prompting techniques and design a new CoT-like prompt, Guided-CoT, and find that injecting domain-specific thoughts increases performance and utility. Guided-CoT handles the in-context policy well, improving performance and utility by reducing refusals across all evaluated models, regardless of decoding configuration, model size, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine LLM errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs' utility, explainability, and reliability. Code and resources available at: https://github.com/idramalab/quantify-llm-explanations","short_abstract":"Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs' capability to detect antisemitic content, specificall...","url_abs":"https://arxiv.org/abs/2509.18293","url_pdf":"https://arxiv.org/pdf/2509.18293v2","authors":"[\"Jay Patel\",\"Hrudayangam Mehta\",\"Jeremy Blackburn\"]","published":"2025-09-22T18:23:21Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.CY\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":609524,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2867897,"paper_url":"https://arxiv.org/abs/2509.18293","paper_title":"Evaluating Large Language Models for Detecting Antisemitism","repo_url":"https://github.com/idramalab/quantify-llm-explanations","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}