{"ID":2870470,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.19344","arxiv_id":"2509.19344","title":"Performance of Large Language Models in Answering Critical Care Medicine Questions","abstract":"Large Language Models have been tested on medical student-level questions, but their performance in specialized fields like Critical Care Medicine (CCM) is less explored. This study evaluated Meta-Llama 3.1 models (8B and 70B parameters) on 871 CCM questions. Llama3.1:70B outperformed 8B by 30%, with 60% average accuracy. Performance varied across domains, highest in Research (68.4%) and lowest in Renal (47.9%), highlighting the need for broader future work to improve models across various subspecialty domains.","short_abstract":"Large Language Models have been tested on medical student-level questions, but their performance in specialized fields like Critical Care Medicine (CCM) is less explored. This study evaluated Meta-Llama 3.1 models (8B and 70B parameters) on 871 CCM questions. Llama3.1:70B outperformed 8B by 30%, with 60% average accura...","url_abs":"https://arxiv.org/abs/2509.19344","url_pdf":"https://arxiv.org/pdf/2509.19344v1","authors":"[\"Mahmoud Alwakeel\",\"Aditya Nagori\",\"An-Kwok Ian Wong\",\"Neal Chaisson\",\"Vijay Krishnamoorthy\",\"Rishikesan Kamaleswaran\"]","published":"2025-09-16T14:46:34Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Language Model\"]","has_code":false}
