{"ID":2852396,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.18941","arxiv_id":"2510.18941","title":"ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge","abstract":"Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: https://huggingface.co/datasets/nvidia/ProfBench and Code: https://github.com/NVlabs/ProfBench and Leaderboard: https://huggingface.co/spaces/nvidia/ProfBench","short_abstract":"Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizi...","url_abs":"https://arxiv.org/abs/2510.18941","url_pdf":"https://arxiv.org/pdf/2510.18941v2","authors":"[\"Zhilin Wang\",\"Jaehun Jung\",\"Ximing Lu\",\"Shizhe Diao\",\"Ellie Evans\",\"Jiaqi Zeng\",\"Pavlo Molchanov\",\"Yejin Choi\",\"Jan Kautz\",\"Yi Dong\"]","published":"2025-10-21T17:59:44Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":607997,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2852396,"paper_url":"https://arxiv.org/abs/2510.18941","paper_title":"ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge","repo_url":"https://github.com/NVlabs/ProfBench","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
