{"ID":2847506,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.27267","arxiv_id":"2510.27267","title":"MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models","abstract":"As large language models (LLMs) enter the medical domain, most benchmarks evaluate them on question answering or descriptive reasoning, overlooking quantitative reasoning critical to clinical decision-making. Existing datasets like MedCalc-Bench cover few calculation tasks and fail to reflect real-world computational scenarios. We introduce MedCalc-Eval, the largest benchmark for assessing LLMs' medical calculation abilities, comprising 700+ tasks across two types: equation-based (e.g., Cockcroft-Gault, BMI, BSA) and rule-based scoring systems (e.g., Apgar, Glasgow Coma Scale). These tasks span diverse specialties including internal medicine, surgery, pediatrics, and cardiology, offering a broader and more challenging evaluation setting. To improve performance, we further develop MedCalc-Env, a reinforcement learning environment built on the InternBootcamp framework, enabling multi-step clinical reasoning and planning. Fine-tuning a Qwen2.5-32B model within this environment achieves state-of-the-art results on MedCalc-Eval, with notable gains in numerical sensitivity, formula selection, and reasoning robustness. Remaining challenges include unit conversion, multi-condition logic, and contextual understanding. Code and datasets are available at https://github.com/maokangkun/MedCalc-Eval.","short_abstract":"As large language models (LLMs) enter the medical domain, most benchmarks evaluate them on question answering or descriptive reasoning, overlooking quantitative reasoning critical to clinical decision-making. Existing datasets like MedCalc-Bench cover few calculation tasks and fail to reflect real-world computational s...","url_abs":"https://arxiv.org/abs/2510.27267","url_pdf":"https://arxiv.org/pdf/2510.27267v1","authors":"[\"Kangkun Mao\",\"Jinru Ding\",\"Jiayuan Chen\",\"Mouxiao Bian\",\"Ruiyao Chen\",\"Xinwei Peng\",\"Sijie Ren\",\"Linyang Li\",\"Jie Xu\"]","published":"2025-10-31T08:07:16Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":607535,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2847506,"paper_url":"https://arxiv.org/abs/2510.27267","paper_title":"MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models","repo_url":"https://github.com/maokangkun/MedCalc-Eval","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
