{"ID":2851410,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.21027","arxiv_id":"2510.21027","title":"Customizing Open Source LLMs for Quantitative Medication Attribute Extraction across Heterogeneous EHR Systems","abstract":"Harmonizing medication data across Electronic Health Record (EHR) systems is a persistent barrier to monitoring medications for opioid use disorder (MOUD). In heterogeneous EHR systems, key prescription attributes are scattered across differently formatted fields and freetext notes. We present a practical framework that customizes open source large language models (LLMs), including Llama, Qwen, Gemma, and MedGemma, to extract a unified set of MOUD prescription attributes (prescription date, drug name, duration, total quantity, daily quantity, and refills) from heterogeneous, site specific data and compute a standardized metric of medication coverage, \\emph{MOUD days}, per patient. Our pipeline processes records directly in a fixed JSON schema, followed by lightweight normalization and cross-field consistency checks. We evaluate the system on prescription level EHR data from five clinics in a national OUD study (25{,}605 records from 1{,}257 patients), using a previously annotated benchmark of 10{,}369 records (776 patients) as the ground truth. Performance is reported as coverage (share of records with a valid, matchable output) and record-level exact-match accuracy. Larger models perform best overall: Qwen2.5-32B achieves \\textbf{93.4\\%} coverage with \\textbf{93.0\\%} exact-match accuracy across clinics, and MedGemma-27B attains \\textbf{93.1\\%}/\\textbf{92.2\\%}. A brief error review highlights three common issues and fixes: imputing missing dosage fields using within-drug norms, handling monthly/weekly injectables (e.g., Vivitrol) by setting duration from the documented schedule, and adding unit checks to prevent mass units (e.g., ``250 g'') from being misread as daily counts. By removing brittle, site-specific ETL and supporting local, privacy-preserving deployment, this approach enables consistent cross-site analyses of MOUD exposure, adherence, and retention in real-world settings.","short_abstract":"Harmonizing medication data across Electronic Health Record (EHR) systems is a persistent barrier to monitoring medications for opioid use disorder (MOUD). In heterogeneous EHR systems, key prescription attributes are scattered across differently formatted fields and freetext notes. We present a practical framework tha...","url_abs":"https://arxiv.org/abs/2510.21027","url_pdf":"https://arxiv.org/pdf/2510.21027v1","authors":"[\"Zhe Fei\",\"Mehmet Yigit Turali\",\"Shreyas Rajesh\",\"Xinyang Dai\",\"Huyen Pham\",\"Pavan Holur\",\"Yuhui Zhu\",\"Larissa Mooney\",\"Yih-Ing Hser\",\"Vwani Roychowdhury\"]","published":"2025-10-23T22:27:10Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
