{"ID":2892245,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.15581","arxiv_id":"2507.15581","title":"Metric assessment protocol in the context of answer fluctuation on MCQ tasks","abstract":"Using multiple-choice questions (MCQs) has become a standard for assessing LLM capabilities efficiently. A variety of metrics can be employed for this task. However, previous research has not conducted a thorough assessment of them. At the same time, MCQ evaluation suffers from answer fluctuation: models produce different results given slight changes in prompts. We suggest a metric assessment protocol in which evaluation methodologies are analyzed through their connection with fluctuation rates, as well as original performance. Our results show that there is a strong link between existing metrics and the answer changing, even when computed without any additional prompt variants. A novel metric, worst accuracy, demonstrates the highest association on the protocol.","short_abstract":"Using multiple-choice questions (MCQs) has become a standard for assessing LLM capabilities efficiently. A variety of metrics can be employed for this task. However, previous research has not conducted a thorough assessment of them. At the same time, MCQ evaluation suffers from answer fluctuation: models produce differ...","url_abs":"https://arxiv.org/abs/2507.15581","url_pdf":"https://arxiv.org/pdf/2507.15581v1","authors":"[\"Ekaterina Goliakova\",\"Xavier Renard\",\"Marie-Jeanne Lesot\",\"Thibault Laugel\",\"Christophe Marsala\",\"Marcin Detyniecki\"]","published":"2025-07-21T13:01:46Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false}