{"ID":2890193,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.19973","arxiv_id":"2507.19973","title":"Leveraging Fine-Tuned Large Language Models for Interpretable Pancreatic Cystic Lesion Feature Extraction and Risk Categorization","abstract":"Background: Manual extraction of pancreatic cystic lesion (PCL) features from radiology reports is labor-intensive, limiting large-scale studies needed to advance PCL research. Purpose: To develop and evaluate large language models (LLMs) that automatically extract PCL features from MRI/CT reports and assign risk categories based on guidelines. Materials and Methods: We curated a training dataset of 6,000 abdominal MRI/CT reports (2005-2024) from 5,134 patients that described PCLs. Labels were generated by GPT-4o using chain-of-thought (CoT) prompting to extract PCL and main pancreatic duct features. Two open-source LLMs were fine-tuned using QLoRA on GPT-4o-generated CoT data. Features were mapped to risk categories per institutional guideline based on the 2017 ACR White Paper. Evaluation was performed on 285 held-out human-annotated reports. Model outputs for 100 cases were independently reviewed by three radiologists. Feature extraction was evaluated using exact match accuracy, risk categorization with macro-averaged F1 score, and radiologist-model agreement with Fleiss' Kappa. Results: CoT fine-tuning improved feature extraction accuracy for LLaMA (80% to 97%) and DeepSeek (79% to 98%), matching GPT-4o (97%). Risk categorization F1 scores also improved (LLaMA: 0.95; DeepSeek: 0.94), closely matching GPT-4o (0.97), with no statistically significant differences. Radiologist inter-reader agreement was high (Fleiss' Kappa = 0.888) and showed no statistically significant difference with the addition of DeepSeek-FT-CoT (Fleiss' Kappa = 0.893) or GPT-CoT (Fleiss' Kappa = 0.897), indicating that both models achieved agreement levels on par with radiologists. Conclusion: Fine-tuned open-source LLMs with CoT supervision enable accurate, interpretable, and efficient phenotyping for large-scale PCL research, achieving performance comparable to GPT-4o.","short_abstract":"Background: Manual extraction of pancreatic cystic lesion (PCL) features from radiology reports is labor-intensive, limiting large-scale studies needed to advance PCL research. Purpose: To develop and evaluate large language models (LLMs) that automatically extract PCL features from MRI/CT reports and assign risk categ...","url_abs":"https://arxiv.org/abs/2507.19973","url_pdf":"https://arxiv.org/pdf/2507.19973v1","authors":"[\"Ebrahim Rasromani\",\"Stella K. Kang\",\"Yanqi Xu\",\"Beisong Liu\",\"Garvit Luhadia\",\"Wan Fung Chui\",\"Felicia L. Pasadyn\",\"Yu Chih Hung\",\"Julie Y. An\",\"Edwin Mathieu\",\"Zehui Gu\",\"Carlos Fernandez-Granda\",\"Ammar A. Javed\",\"Greg D. Sacks\",\"Tamas Gonda\",\"Chenchan Huang\",\"Yiqiu Shen\"]","published":"2025-07-26T15:02:32Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\",\"cs.IR\"]","methods":"[\"Large Language Model\",\"Language Model\",\"LoRA\"]","has_code":false}