{"ID":2857658,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.09517","arxiv_id":"2510.09517","title":"StatEval: A Comprehensive Benchmark for Large Language Models in Statistics","abstract":"Large language models (LLMs) have demonstrated remarkable advances in mathematical and logical reasoning, yet statistics, as a distinct and integrative discipline, remains underexplored in benchmarking efforts. To address this gap, we introduce \\textbf{StatEval}, the first comprehensive benchmark dedicated to statistics, spanning both breadth and depth across difficulty levels. StatEval consists of 13,817 foundational problems covering undergraduate and graduate curricula, together with 2374 research-level proof tasks extracted from leading journals. To construct the benchmark, we design a scalable multi-agent pipeline with human-in-the-loop validation that automates large-scale problem extraction, rewriting, and quality control, while ensuring academic rigor. We further propose a robust evaluation framework tailored to both computational and proof-based tasks, enabling fine-grained assessment of reasoning ability. Experimental results reveal that while closed-source models such as GPT5-mini achieve below 57\\% on research-level problems, with open-source models performing significantly lower. These findings highlight the unique challenges of statistical reasoning and the limitations of current LLMs. We expect StatEval to serve as a rigorous benchmark for advancing statistical intelligence in large language models. All data and code are available on our web platform: https://stateval.github.io/.","short_abstract":"Large language models (LLMs) have demonstrated remarkable advances in mathematical and logical reasoning, yet statistics, as a distinct and integrative discipline, remains underexplored in benchmarking efforts. To address this gap, we introduce \\textbf{StatEval}, the first comprehensive benchmark dedicated to statistic...","url_abs":"https://arxiv.org/abs/2510.09517","url_pdf":"https://arxiv.org/pdf/2510.09517v1","authors":"[\"Yuchen Lu\",\"Run Yang\",\"Yichen Zhang\",\"Shuguang Yu\",\"Runpeng Dai\",\"Ziwei Wang\",\"Jiayi Xiang\",\"Wenxin E\",\"Siran Gao\",\"Xinyao Ruan\",\"Yirui Huang\",\"Chenjing Xi\",\"Haibo Hu\",\"Yueming Fu\",\"Qinglan Yu\",\"Xiaobing Wei\",\"Jiani Gu\",\"Rui Sun\",\"Jiaxuan Jia\",\"Fan Zhou\"]","published":"2025-10-10T16:28:43Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
