{"ID":2826393,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.19691","arxiv_id":"2512.19691","title":"Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight","abstract":"Reference labels for machine-learning benchmarks are increasingly synthesized with LLM assistance, but their reliability remains underexamined. We audit MedCalc-Bench, a clinical benchmark for medical score computation whose labels were partly derived with LLM assistance, and develop a scalable physician-in-the-loop stewardship pipeline to reassess them. At least 27% of test labels are likely erroneous or incomputable. On a 50-instance subset validated by physicians, our recomputed labels agree with physician ground truth 74% of the time (95% CI, 60-84%) versus 20% for the originals (95% CI, 11-33%). Using original labels to evaluate frontier LLMs underestimates accuracy by 16-23 percentage points. In a controlled reinforcement-learning experiment, a model trained on recomputed labels outperforms one trained on originals by 13.5 percentage points (95% CI, 10.6-16.6%) on physician-labeled instances, and this advantage extends to related medical tasks. LLM-assisted benchmarks can propagate systematic errors into both evaluation and post-training unless actively stewarded.","short_abstract":"Reference labels for machine-learning benchmarks are increasingly synthesized with LLM assistance, but their reliability remains underexamined. We audit MedCalc-Bench, a clinical benchmark for medical score computation whose labels were partly derived with LLM assistance, and develop a scalable physician-in-the-loop st...","url_abs":"https://arxiv.org/abs/2512.19691","url_pdf":"https://arxiv.org/pdf/2512.19691v3","authors":"[\"Junze Ye\",\"Daniel Tawfik\",\"Alex J. Goodell\",\"Nikhil V. Kotha\",\"Mark K. Buyyounouski\",\"Mohsen Bayati\"]","published":"2025-12-22T18:59:34Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"stat.AP\"]","methods":"[\"Large Language Model\"]","has_code":false}