{"ID":2865695,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.20645","arxiv_id":"2509.20645","title":"Anticipatory Evaluation of Language Models","abstract":"Progress in large language models is increasingly constrained by an evaluation bottleneck: benchmarks must be built and models run before iteration can begin. We investigate whether evaluation outcomes can be forecast before any experiments are conducted. Specifically, we study text-only performance prediction, where models estimate performance from task descriptions and experimental configurations alone, without access to dataset instances. To support systematic study, we curate PRECOG, a corpus of description-performance pairs spanning diverse tasks, domains, and metrics. We scrape task and configuration descriptions from arXiv, yielding 2,290 instances covering 1,519 papers, and construct a test split using papers published after the evaluated models' knowledge cutoff. Experiments show the task is challenging but feasible: reasoning models achieve a non-trivial forecasting skill reaching mean absolute error as low as 9.9 at high-confidence thresholds. Overall, our corpus and analyses offer an initial step toward open-ended anticipatory evaluation, supporting difficulty estimation and smarter resource allocation.","short_abstract":"Progress in large language models is increasingly constrained by an evaluation bottleneck: benchmarks must be built and models run before iteration can begin. We investigate whether evaluation outcomes can be forecast before any experiments are conducted. Specifically, we study text-only performance prediction, where m...","url_abs":"https://arxiv.org/abs/2509.20645","url_pdf":"https://arxiv.org/pdf/2509.20645v3","authors":"[\"Jungsoo Park\",\"Ethan Mendes\",\"Gabriel Stanovsky\",\"Alan Ritter\"]","published":"2025-09-25T01:02:27Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false}
