{"ID":2855154,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.13430","arxiv_id":"2510.13430","title":"Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps","abstract":"This survey provides the first systematic review of Arabic LLM benchmarks, analyzing 40+ evaluation benchmarks across NLP tasks, knowledge domains, cultural understanding, and specialized capabilities. We propose a taxonomy organizing benchmarks into four categories: Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. Our analysis reveals significant progress in benchmark diversity while identifying critical gaps: limited temporal evaluation, insufficient multi-turn dialogue assessment, and cultural misalignment in translated datasets. We examine three primary approaches: native collection, translation, and synthetic generation discussing their trade-offs regarding authenticity, scale, and cost. This work serves as a comprehensive reference for Arabic NLP researchers, providing insights into benchmark methodologies, reproducibility standards, and evaluation metrics while offering recommendations for future development.","short_abstract":"This survey provides the first systematic review of Arabic LLM benchmarks, analyzing 40+ evaluation benchmarks across NLP tasks, knowledge domains, cultural understanding, and specialized capabilities. We propose a taxonomy organizing benchmarks into four categories: Knowledge, NLP Tasks, Culture and Dialects, and Targ...","url_abs":"https://arxiv.org/abs/2510.13430","url_pdf":"https://arxiv.org/pdf/2510.13430v2","authors":"[\"Ahmed Alzubaidi\",\"Shaikha Alsuwaidi\",\"Basma El Amel Boussaha\",\"Leen AlQadi\",\"Omar Alkaabi\",\"Mohammed Alyafeai\",\"Hamza Alobeidli\",\"Hakim Hacid\"]","published":"2025-10-15T11:25:33Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\",\"Generative Adversarial Network\"]","has_code":false}