{"ID":2878338,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.17580","arxiv_id":"2508.17580","title":"UQ: Assessing Language Models on Unsolved Questions","abstract":"Benchmarks shape progress in AI research. A useful benchmark should be both difficult and realistic: questions should challenge frontier models while also reflecting real-world usage. Yet, current paradigms face a difficulty-realism tension: exam-style benchmarks are often made artificially difficult with limited real-world value, while benchmarks based on real user interaction often skew toward easy, high-frequency problems. In this work, we explore a radically different paradigm: assessing models on unsolved questions. Rather than a static benchmark scored once, we curate unsolved questions and evaluate models asynchronously over time with validator-assisted screening and community verification. We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange, spanning topics from CS theory and math to sci-fi and history, probing capabilities including reasoning, factuality, and browsing. UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers, thus solving them yields direct real-world value. Our contributions are threefold: (1) UQ-Dataset and its collection pipeline combining rule-based filters, LLM judges, and human review to ensure question quality (e.g., well-defined and difficult); (2) UQ-Validators, compound validation strategies that leverage the generator-validator gap to provide evaluation signals and pre-screen candidate solutions for human review; and (3) UQ-Platform, an open platform where experts collectively verify questions and solutions. The top model passes UQ-validation on only 15% of questions, and preliminary human verification has already identified correct answers among those that passed. UQ charts a path for evaluating frontier models on real-world, open-ended challenges, where success pushes the frontier of human knowledge. We release UQ at https://uq.stanford.edu.","short_abstract":"Benchmarks shape progress in AI research. A useful benchmark should be both difficult and realistic: questions should challenge frontier models while also reflecting real-world usage. Yet, current paradigms face a difficulty-realism tension: exam-style benchmarks are often made artificially difficult with limited real-...","url_abs":"https://arxiv.org/abs/2508.17580","url_pdf":"https://arxiv.org/pdf/2508.17580v1","authors":"[\"Fan Nie\",\"Ken Ziyu Liu\",\"Zihao Wang\",\"Rui Sun\",\"Wei Liu\",\"Weijia Shi\",\"Huaxiu Yao\",\"Linjun Zhang\",\"Andrew Y. Ng\",\"James Zou\",\"Sanmi Koyejo\",\"Yejin Choi\",\"Percy Liang\",\"Niklas Muennighoff\"]","published":"2025-08-25T01:07:59Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","project_urls":"[\"https://uq.stanford.edu\"]","has_code":false,"code_links":[{"ID":610473,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2878338,"paper_url":"https://arxiv.org/abs/2508.17580","paper_title":"UQ: Assessing Language Models on Unsolved Questions","repo_url":"https://github.com/uq-project/UQ","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
