{"ID":2828695,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.14917","arxiv_id":"2512.14917","title":"Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings","abstract":"Code reasoning tasks are becoming prevalent in large language model (LLM) assessments. Yet, there is a dearth of studies on the impact of real-world complexities on code reasoning, e.g., inter- or intra-procedural dependencies, API calls, deeply nested constructs, and non-primitive complex types. Evaluating LLMs under such a simplistic setting poses a significant threat to assumptions about their generalizability in practice. To enable a more realistic evaluation of code reasoning, we construct a dataset of 1200 reasoning problems from two sources: existing code reasoning benchmarks and popular GitHub Python repositories. Our pipeline leverages static and dynamic program analysis to automatically serialize/deserialize compound, complex, and custom types galore in real-world code, going far beyond only primitive types used in prior studies. A key feature of our dataset is categorizing each reasoning problem as Lower Complexity (LC) or Higher Complexity (HC) via a principled majority-vote mechanism over nine diverse and interpretable code-complexity metrics, yielding two well-separated, semantically meaningful categories of problem difficulty suitable for precise calibration of LLM reasoning ability. This categorization shows that the problems used in existing code-reasoning evaluation mostly belong to the LC category, failing to represent real-world complexity.","short_abstract":"Code reasoning tasks are becoming prevalent in large language model (LLM) assessments. Yet, there is a dearth of studies on the impact of real-world complexities on code reasoning, e.g., inter- or intra-procedural dependencies, API calls, deeply nested constructs, and non-primitive complex types. Evaluating LLMs under...","url_abs":"https://arxiv.org/abs/2512.14917","url_pdf":"https://arxiv.org/pdf/2512.14917v3","authors":"[\"Changshu Liu\",\"Alireza Ghazanfari\",\"Yang Chen\",\"Reyhaneh Jabbarvand\"]","published":"2025-12-16T21:12:53Z","proceeding":"cs.SE","tasks":"[\"cs.SE\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}