{"ID":2834767,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.20633","arxiv_id":"2512.20633","title":"Enhancing Lung Cancer Treatment Outcome Prediction through Semantic Feature Engineering Using Large Language Models","abstract":"Accurate prediction of treatment outcomes in lung cancer remains challenging due to the sparsity, heterogeneity, and contextual overload of real-world electronic health data. Traditional models often fail to capture semantic information across multimodal streams, while large-scale fine-tuning approaches are impractical in clinical workflows. We introduce a framework that uses Large Language Models (LLMs) as Goal-oriented Knowledge Curators (GKC) to convert laboratory, genomic, and medication data into high-fidelity, task-aligned features. Unlike generic embeddings, GKC produces representations tailored to the prediction objective and operates as an offline preprocessing step that integrates naturally into hospital informatics pipelines. Using a lung cancer cohort (N=184), we benchmarked GKC against expert-engineered features, direct text embeddings, and an end-to-end transformer. Our approach achieved a mean AUROC of 0.803 (95% CI: 0.799-0.807) and outperformed all baselines. An ablation study further confirmed the complementary value of combining all three modalities. These results show that the quality of semantic representation is a key determinant of predictive accuracy in sparse clinical data settings. By reframing LLMs as knowledge curation engines rather than black-box predictors, this work demonstrates a scalable, interpretable, and workflow-compatible pathway for advancing AI-driven decision support in oncology.","short_abstract":"Accurate prediction of treatment outcomes in lung cancer remains challenging due to the sparsity, heterogeneity, and contextual overload of real-world electronic health data. Traditional models often fail to capture semantic information across multimodal streams, while large-scale fine-tuning approaches are impractical...","url_abs":"https://arxiv.org/abs/2512.20633","url_pdf":"https://arxiv.org/pdf/2512.20633v1","authors":"[\"MunHwan Lee\",\"Shaika Chowdhury\",\"Xiaodi Li\",\"Sivaraman Rajaganapathy\",\"Eric W Klee\",\"Ping Yang\",\"Terence Sio\",\"Liewei Wang\",\"James Cerhan\",\"Nansu NA Zong\"]","published":"2025-12-01T23:56:45Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Transformer\",\"Large Language Model\",\"Language Model\"]","has_code":false}