{"ID":2852887,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.17725","arxiv_id":"2510.17725","title":"AcademicEval: Live Long-Context LLM Benchmark","abstract":"Large Language Models (LLMs) have recently achieved remarkable performance in long-context understanding. However, current long-context LLM benchmarks are limited by rigid context length, labor-intensive annotation, and the pressing challenge of label leakage issues during LLM training. Therefore, we propose \\textsc{AcademicEval}, a live benchmark for evaluating LLMs over long-context generation tasks. \\textsc{AcademicEval} adopts papers on arXiv to introduce several academic writing tasks with long-context inputs, \\textit{i.e.}, \\textsc{Title}, \\textsc{Abstract}, \\textsc{Introduction}, and \\textsc{Related Work}, which cover a wide range of abstraction levels and require no manual labeling. Moreover, \\textsc{AcademicEval} integrates high-quality and expert-curated few-shot demonstrations from a collected co-author graph to enable flexible context length. Especially, \\textsc{AcademicEval} features an efficient live evaluation, ensuring no label leakage. We conduct a holistic evaluation on \\textsc{AcademicEval}, and the results illustrate that LLMs perform poorly on tasks with hierarchical abstraction levels and tend to struggle with long few-shot demonstrations, highlighting the challenge of our benchmark. Through experimental analysis, we also reveal some insights for enhancing LLMs' long-context modeling capabilities. Code is available at https://github.com/ulab-uiuc/AcademicEval","short_abstract":"Large Language Models (LLMs) have recently achieved remarkable performance in long-context understanding. However, current long-context LLM benchmarks are limited by rigid context length, labor-intensive annotation, and the pressing challenge of label leakage issues during LLM training. Therefore, we propose \\textsc{Ac...","url_abs":"https://arxiv.org/abs/2510.17725","url_pdf":"https://arxiv.org/pdf/2510.17725v1","authors":"[\"Haozhen Zhang\",\"Tao Feng\",\"Pengrui Han\",\"Jiaxuan You\"]","published":"2025-10-20T16:42:30Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":608039,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2852887,"paper_url":"https://arxiv.org/abs/2510.17725","paper_title":"AcademicEval: Live Long-Context LLM Benchmark","repo_url":"https://github.com/ulab-uiuc/AcademicEval","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}