{"ID":2823705,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.24796","arxiv_id":"2512.24796","title":"LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories)","abstract":"While large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, current benchmarks fail to adequately measure library-grounded abstraction -- the ability to reason with high-level interfaces and reusable structures central to modern mathematics and software engineering. We introduce LeanCat, a challenging benchmark comprising 100 fully formalized category-theory tasks in Lean. Unlike algebra or arithmetic, category theory serves as a rigorous stress test for structural, interface-level reasoning. Our evaluation reveals a severe abstraction gap: the best state-of-the-art model solves only 12.0% of tasks at pass@4, with performance collapsing from 55.0% on Easy tasks to 0.0% on High-difficulty tasks, highlighting a failure in compositional generalization. To overcome this, we evaluate LeanBridge, a retrieval-augmented agent that employs a retrieve-generate-verify loop. LeanBridge achieves a peak success rate of 24.0% -- doubling the performance of the best static baseline. These results empirically demonstrate that iterative refinement and dynamic library retrieval are not merely optimizations but strict necessities for neuro-symbolic reasoning in abstract domains. LeanCat offers a compact, reusable testbed for tracking progress toward reliable, research-level formalization.","short_abstract":"While large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, current benchmarks fail to adequately measure library-grounded abstraction -- the ability to reason with high-level interfaces and reusable structures central to modern mathematics and software engineering. We introd...","url_abs":"https://arxiv.org/abs/2512.24796","url_pdf":"https://arxiv.org/pdf/2512.24796v2","authors":"[\"Rongge Xu\",\"Hui Dai\",\"Yiming Fu\",\"Jiedong Jiang\",\"Tianjiao Nie\",\"Junkai Wang\",\"Holiverse Yang\",\"Zhi-Hao Zhang\"]","published":"2025-12-31T11:33:29Z","proceeding":"cs.LO","tasks":"[\"cs.LO\",\"cs.AI\",\"cs.FL\",\"cs.LG\",\"math.CT\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
