{"ID":2846895,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.02102","arxiv_id":"2511.02102","title":"Enhancing Phenotype Discovery in Electronic Health Records through Prior Knowledge-Guided Unsupervised Learning","abstract":"Objectives: Unsupervised learning with electronic health record (EHR) data has shown promise for phenotype discovery, but approaches typically disregard existing clinical information, limiting interpretability. We operationalize a Bayesian latent class framework for phenotyping that incorporates domain-specific knowledge to improve clinical meaningfulness of EHR-derived phenotypes and illustrate its utility by identifying an asthma sub-phenotype informed by features of Type 2 (T2) inflammation. Materials and methods: We illustrate a framework for incorporating clinical knowledge into a Bayesian latent class model via informative priors to guide unsupervised clustering toward clinically relevant subgroups. This approach models missingness, accounting for potential missing-not-at-random patterns, and provides patient-level probabilities for phenotype assignment with uncertainty. Using reusable and flexible code, we applied the model to a large asthma EHR cohort, specifying informative priors for T2 inflammation-related features and weakly informative priors for other clinical variables, allowing the data to inform posterior distributions. Results and Conclusion: Using encounter data from January 2017 to February 2024 for 44,642 adult asthma patients, we found a bimodal posterior distribution of phenotype assignment, indicating clear class separation. The T2 inflammation-informed class (38.7%) was characterized by elevated eosinophil levels and allergy markers, plus high healthcare utilization and medication use, despite weakly informative priors on the latter variables. These patterns suggest an \"uncontrolled T2-high\" sub-phenotype. This demonstrates how our Bayesian latent class modeling approach supports hypothesis generation and cohort identification in EHR-based studies of heterogeneous diseases without well-established phenotype definitions.","short_abstract":"Objectives: Unsupervised learning with electronic health record (EHR) data has shown promise for phenotype discovery, but approaches typically disregard existing clinical information, limiting interpretability. We operationalize a Bayesian latent class framework for phenotyping that incorporates domain-specific knowled...","url_abs":"https://arxiv.org/abs/2511.02102","url_pdf":"https://arxiv.org/pdf/2511.02102v1","authors":"[\"Melanie Mayer\",\"Kimberly Lactaoen\",\"Gary E. Weissman\",\"Blanca E. Himes\",\"Rebecca A. Hubbard\"]","published":"2025-11-03T22:25:07Z","proceeding":"stat.AP","tasks":"[\"stat.AP\",\"stat.ML\"]","methods":"[]","has_code":false}
