{"ID":2851136,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.20460","arxiv_id":"2510.20460","title":"Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models","abstract":"Large language models (LLMs) produce outputs with varying levels of uncertainty, and, just as often, varying levels of correctness; making their practical reliability far from guaranteed. To quantify this uncertainty, we systematically evaluate four approaches for confidence estimation in LLM outputs: VCE, MSP, Sample Consistency, and CoCoA (Vashurin et al., 2025). For the evaluation of the approaches, we conduct experiments on four question-answering tasks using a state-of-the-art open-source LLM. Our results show that each uncertainty metric captures a different facet of model confidence and that the hybrid CoCoA approach yields the best reliability overall, improving both calibration and discrimination of correct answers. We discuss the trade-offs of each method and provide recommendations for selecting uncertainty measures in LLM applications.","short_abstract":"Large language models (LLMs) produce outputs with varying levels of uncertainty, and, just as often, varying levels of correctness; making their practical reliability far from guaranteed. To quantify this uncertainty, we systematically evaluate four approaches for confidence estimation in LLM outputs: VCE, MSP, Sample...","url_abs":"https://arxiv.org/abs/2510.20460","url_pdf":"https://arxiv.org/pdf/2510.20460v1","authors":"[\"Christian Hobelsberger\",\"Theresa Winner\",\"Andreas Nawroth\",\"Oliver Mitevski\",\"Anna-Carolina Haensch\"]","published":"2025-10-23T11:50:47Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"stat.AP\",\"stat.ME\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}