{"ID":2867659,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.17677","arxiv_id":"2509.17677","title":"EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving","abstract":"Large language models (LLMs) have shown strong performance on mathematical reasoning under well-defined conditions. However, real-world engineering problems involve uncertainty, context, and open-ended settings that extend beyond symbolic computation. Existing benchmarks largely focus on well-defined or abstract reasoning and therefore fail to capture these complexities. We introduce EngiBench, a hierarchical benchmark designed to evaluate LLMs on solving engineering problems. It spans three levels of increasing difficulty (foundational knowledge retrieval, contextual reasoning, and open-ended modeling) and covers diverse engineering subfields. To facilitate a deeper understanding of model performance, we systematically rewrite each problem into three controlled variants (perturbed, knowledge-enhanced, and math abstraction), enabling us to separately evaluate the model's robustness, domain-specific knowledge, and mathematical reasoning abilities. Experimental results show clear performance stratification across difficulty levels: model accuracy declines with task complexity, degrades under minor perturbations, and remains substantially below human performance on high-level engineering tasks. These findings reveal that current LLMs still lack the high-level reasoning needed for real-world engineering, highlighting the need for future models with deeper and more reliable problem-solving capabilities. Our source code and data are available at https://github.com/AI4Engi/EngiBench.","short_abstract":"Large language models (LLMs) have shown strong performance on mathematical reasoning under well-defined conditions. However, real-world engineering problems involve uncertainty, context, and open-ended settings that extend beyond symbolic computation. Existing benchmarks largely focus on well-defined or abstract reason...","url_abs":"https://arxiv.org/abs/2509.17677","url_pdf":"https://arxiv.org/pdf/2509.17677v2","authors":"[\"Xiyuan Zhou\",\"Xinlei Wang\",\"Yirui He\",\"Yang Wu\",\"Ruixi Zou\",\"Yuheng Cheng\",\"Yulu Xie\",\"Wenxuan Liu\",\"Huan Zhao\",\"Yan Xu\",\"Jinjin Gu\",\"Junhua Zhao\"]","published":"2025-09-22T12:20:27Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":609502,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2867659,"paper_url":"https://arxiv.org/abs/2509.17677","paper_title":"EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving","repo_url":"https://github.com/AI4Engi/EngiBench","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}