{"ID":2887450,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.01956","arxiv_id":"2508.01956","title":"Scaling Clinician-Grade Feature Generation from Clinical Notes with Multi-Agent Language Models","abstract":"Developing accurate clinical prediction models is often bottlenecked by the difficulty of deriving meaningful structured features from unstructured EHR notes, a process that traditionally requires manual, unscalable clinical abstraction. In this study, we first established a rigorous patient-level Clinician Feature Generation (CFG) protocol, in which domain experts manually reviewed notes to define and extract nuanced features for a cohort of 147 patients with prostate cancer. As a high-fidelity ground truth, this labor-intensive process provided the blueprint for SNOW (Scalable Note-to-Outcome Workflow), a transparent multi-agent large language model (LLM) system designed to autonomously mimic the iterative reasoning and validation workflow of clinical experts. On 5-year cancer recurrence prediction, SNOW (AUC-ROC 0.767) achieved performance comparable to manual CFG (0.762) and outperformed structured baselines, clinician-guided LLM extraction, and six representational feature generation (RFG) approaches. Once configured, SNOW produced the full patient-level feature table in 12 hours with 5 hours of clinician oversight, reducing human expert effort by approximately 48-fold versus manual CFG. To test scalability where manual CFG is infeasible, we deployed SNOW on an external heart failure with preserved ejection fraction (HFpEF) cohort from MIMIC-IV (n=2,084); without task-specific tuning, SNOW generated prognostic features that outperformed baseline and RFG methods for 30-day (SNOW: 0.851) and 1-year (SNOW: 0.763) mortality prediction. These results demonstrate that a modular LLM agent-based system can scale expert-level feature generation from clinical notes, while enabling interpretable use of unstructured EHR text in outcome prediction and preserving generalizability across a variety of settings and conditions.","short_abstract":"Developing accurate clinical prediction models is often bottlenecked by the difficulty of deriving meaningful structured features from unstructured EHR notes, a process that traditionally requires manual, unscalable clinical abstraction. In this study, we first established a rigorous patient-level Clinician Feature Gen...","url_abs":"https://arxiv.org/abs/2508.01956","url_pdf":"https://arxiv.org/pdf/2508.01956v2","authors":"[\"Jiayi Wang\",\"Jacqueline Jil Vallon\",\"Nikhil V. Kotha\",\"Neil Panjwani\",\"Xi Ling\",\"Margaret Redfield\",\"Sushmita Vij\",\"Sandy Srinivas\",\"John Leppert\",\"Mark K. Buyyounouski\",\"Mohsen Bayati\"]","published":"2025-08-03T23:45:18Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.LG\",\"cs.MA\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}