{"ID":2862727,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.26136","arxiv_id":"2509.26136","title":"CliniBench: A Clinical Outcome Prediction Benchmark for Generative and Encoder-Based Language Models","abstract":"With their growing capabilities, generative large language models (LLMs) are being increasingly investigated for complex medical tasks. However, their effectiveness in real-world clinical applications remains underexplored. To address this, we present CliniBench, the first benchmark that enables comparability of well-studied encoder-based classifiers and generative LLMs for discharge diagnosis prediction from admission notes in MIMIC-IV dataset. Our extensive study compares 12 generative LLMs and 3 encoder-based classifiers and demonstrates that encoder-based classifiers consistently outperform generative models in diagnosis prediction. We assess several retrieval augmentation strategies for in-context learning from similar patients and find that they provide notable performance improvements for generative LLMs.","short_abstract":"With their growing capabilities, generative large language models (LLMs) are being increasingly investigated for complex medical tasks. However, their effectiveness in real-world clinical applications remains underexplored. To address this, we present CliniBench, the first benchmark that enables comparability of well-s...","url_abs":"https://arxiv.org/abs/2509.26136","url_pdf":"https://arxiv.org/pdf/2509.26136v2","authors":"[\"Paul Grundmann\",\"Dennis Fast\",\"Jan Frick\",\"Thomas Steffek\",\"Felix Gers\",\"Wolfgang Nejdl\",\"Alexander Löser\"]","published":"2025-09-30T11:56:53Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}