{"ID":3084700,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-06T20:54:36.964885582Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05463","arxiv_id":"2606.05463","title":"PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage","abstract":"Patient safety event triage, determining whether a clinical event is reportable under jurisdiction-specific policy, is a high-stakes task typically performed manually by patient safety experts. Although LLMs may support this workflow, reliable evaluation is limited by the lack of benchmarks to capture evidence-grounded policy reasoning, proactive information seeking for incomplete reports, and principled abstention in irreducibly ambiguous cases. We address this gap with a policy-grounded construction methodology centered on the clause card, a structured representation that factorizes regulatory text into auditable decision specifications. Combining clause cards with anchor-driven instantiation and closed-loop verification, our scalable pipeline produces narratives with by-construction ground truth and naturally supports generating missing information and uncertain variants. We instantiate this method on Minnesota's 29 Reportable Adverse Health Events, producing PSEBench, a 5,074-case benchmark with an agentic evaluation environment. Evaluation on 15 representative LLMs reveals consistent capability trends, demonstrates the benchmark's utility, and identifies actionable gaps toward reliable LLM-based patient safety event triage.","short_abstract":"Patient safety event triage, determining whether a clinical event is reportable under jurisdiction-specific policy, is a high-stakes task typically performed manually by patient safety experts. Although LLMs may support this workflow, reliable evaluation is limited by the lack of benchmarks to capture evidence-grounded...","url_abs":"https://arxiv.org/abs/2606.05463","url_pdf":"https://arxiv.org/pdf/2606.05463v1","authors":"[\"Keqi Han\",\"Ryan Young\",\"Annabel Strauss\",\"Lindsey Hughes\",\"Katharine M. Nesbitt\",\"Nicole Schueler\",\"Che Ngufor\",\"Carl Yang\",\"Yuan Xue\",\"Zhijun Yin\"]","published":"2026-06-03T21:41:39Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false}
