{"ID":2879633,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.15149","arxiv_id":"2508.15149","title":"A Robust BERT-Based Deep Learning Model for Automated Cancer Type Extraction from Unstructured Pathology Reports","abstract":"The accurate extraction of clinical information from electronic medical records is particularly critical to clinical research but require much trained expertise and manual labor. In this study we developed a robust system for automated extraction of the specific cancer types for the purpose of supporting precision oncology research. from pathology reports using a fine-tuned RoBERTa model. This model significantly outperformed the baseline model and a Large Language Model, Mistral 7B, achieving F1_Bertscore 0.98 and overall exact match of 80.61%. This fine-tuning approach demonstrates the potential for scalability that can integrate seamlessly into the molecular tumour board process. Fine-tuning domain-specific models for precision tasks in oncology, may pave the way for more efficient and accurate clinical information extraction.","short_abstract":"The accurate extraction of clinical information from electronic medical records is particularly critical to clinical research but require much trained expertise and manual labor. In this study we developed a robust system for automated extraction of the specific cancer types for the purpose of supporting precision onco...","url_abs":"https://arxiv.org/abs/2508.15149","url_pdf":"https://arxiv.org/pdf/2508.15149v1","authors":"[\"Minh Tran\",\"Jeffery C. Chan\",\"Min Li Huang\",\"Maya Kansara\",\"John P. Grady\",\"Christine E. Napier\",\"Subotheni Thavaneswaran\",\"Mandy L. Ballinger\",\"David M. Thomas\",\"Frank P. Lin\"]","published":"2025-08-21T01:12:39Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false}
