{"ID":2864466,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.24086","arxiv_id":"2509.24086","title":"Do Repetitions Matter? Strengthening Reliability in LLM Evaluations","abstract":"LLM leaderboards often rely on single stochastic runs, but how many repetitions are required for reliable conclusions remains unclear. We re-evaluate eight state-of-the-art models on the AI4Math Benchmark with three independent runs per setting. Using mixed-effects logistic regression, domain-level marginal means, rank-instability analysis, and run-to-run reliability, we assessed the value of additional repetitions. Our findings shows that Single-run leaderboards are brittle: 10/12 slices (83\\%) invert at least one pairwise rank relative to the three-run majority, despite a zero sign-flip rate for pairwise significance and moderate overall interclass correlation. Averaging runs yields modest SE shrinkage ($\\sim$5\\% from one to three) but large ranking gains; two runs remove $\\sim$83\\% of single-run inversions. We provide cost-aware guidance for practitioners: treat evaluation as an experiment, report uncertainty, and use $\\geq 2$ repetitions under stochastic decoding. These practices improve robustness while remaining feasible for small teams and help align model comparisons with real-world reliability.","short_abstract":"LLM leaderboards often rely on single stochastic runs, but how many repetitions are required for reliable conclusions remains unclear. We re-evaluate eight state-of-the-art models on the AI4Math Benchmark with three independent runs per setting. Using mixed-effects logistic regression, domain-level marginal means, rank...","url_abs":"https://arxiv.org/abs/2509.24086","url_pdf":"https://arxiv.org/pdf/2509.24086v1","authors":"[\"Miguel Angel Alvarado Gonzalez\",\"Michelle Bruno Hernandez\",\"Miguel Angel Peñaloza Perez\",\"Bruno Lopez Orozco\",\"Jesus Tadeo Cruz Soto\",\"Sandra Malagon\"]","published":"2025-09-28T21:45:20Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\"]","methods":"[\"Large Language Model\"]","has_code":false}
