{"ID":2886518,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.08292","arxiv_id":"2508.08292","title":"Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs","abstract":"Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving \u003e 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables and constants. The variation protocol produces an unlimited stream of equally difficult, unseen instances -- yielding a contamination-resilient test bed. On the Original set, OpenAI's o1-preview -- the strongest evaluated model -- scores 41.9%, but its accuracy drops by 19.6% (46.8% relative decrease) on the paired Variations. The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals. These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement \"boxed\" accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations. Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs. Data and evaluation code are publicly available at https://github.com/brando90/putnam-axiom.","short_abstract":"Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving \u003e 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious Willia...","url_abs":"https://arxiv.org/abs/2508.08292","url_pdf":"https://arxiv.org/pdf/2508.08292v2","authors":"[\"Aryan Gulati\",\"Brando Miranda\",\"Eric Chen\",\"Emily Xia\",\"Kai Fronsdal\",\"Bruno Dumont\",\"Elyas Obbad\",\"Sanmi Koyejo\"]","published":"2025-08-05T17:57:50Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.LG\",\"cs.LO\",\"cs.NE\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":611320,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2886518,"paper_url":"https://arxiv.org/abs/2508.08292","paper_title":"Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs","repo_url":"https://github.com/brando90/putnam-axiom","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
