{"ID":2877884,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.18646","arxiv_id":"2508.18646","title":"Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap","abstract":"For Large Language Models (LLMs), a disconnect persists between benchmark performance and real-world utility. Current evaluation frameworks remain fragmented, prioritizing technical metrics while neglecting holistic assessment for deployment. This survey introduces an anthropomorphic evaluation paradigm through the lens of human intelligence, proposing a novel three-dimensional taxonomy: Intelligence Quotient (IQ)-General Intelligence for foundational capacity, Emotional Quotient (EQ)-Alignment Ability for value-based interactions, and Professional Quotient (PQ)-Professional Expertise for specialized proficiency. For practical value, we pioneer a Value-oriented Evaluation (VQ) framework assessing economic viability, social impact, ethical alignment, and environmental sustainability. Our modular architecture integrates six components with an implementation roadmap. Through analysis of 200+ benchmarks, we identify key challenges including dynamic assessment needs and interpretability gaps. It provides actionable guidance for developing LLMs that are technically proficient, contextually relevant, and ethically sound. We maintain a curated repository of open-source evaluation resources at: https://github.com/onejune2018/Awesome-LLM-Eval.","short_abstract":"For Large Language Models (LLMs), a disconnect persists between benchmark performance and real-world utility. Current evaluation frameworks remain fragmented, prioritizing technical metrics while neglecting holistic assessment for deployment. This survey introduces an anthropomorphic evaluation paradigm through the len...","url_abs":"https://arxiv.org/abs/2508.18646","url_pdf":"https://arxiv.org/pdf/2508.18646v2","authors":"[\"Jun Wang\",\"Ninglun Gu\",\"Kailai Zhang\",\"Zijiao Zhang\",\"Yelun Bao\",\"Jin Yang\",\"Xu Yin\",\"Liwei Liu\",\"Yihuan Liu\",\"Pengyong Li\",\"Gary G. Yen\",\"Junchi Yan\"]","published":"2025-08-26T03:43:05Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":610423,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2877884,"paper_url":"https://arxiv.org/abs/2508.18646","paper_title":"Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap","repo_url":"https://github.com/onejune2018/Awesome-LLM-Eval","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
