{"ID":2879227,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.17162","arxiv_id":"2508.17162","title":"Quantifying Language Disparities in Multilingual Large Language Models","abstract":"Results reported in large-scale multilingual evaluations are often fragmented and confounded by factors such as target languages, differences in experimental setups, and model choices. We propose a framework that disentangles these confounding variables and introduces three interpretable metrics--the performance realisation ratio, its coefficient of variation, and language potential--enabling a finer-grained and more insightful quantification of actual performance disparities across both (i) models and (ii) languages. Through a case study of 13 model variants on 11 multilingual datasets, we demonstrate that our framework provides a more reliable measurement of model performance and language disparities, particularly for low-resource languages, which have so far proven challenging to evaluate. Importantly, our results reveal that higher overall model performance does not necessarily imply greater fairness across languages.","short_abstract":"Results reported in large-scale multilingual evaluations are often fragmented and confounded by factors such as target languages, differences in experimental setups, and model choices. We propose a framework that disentangles these confounding variables and introduces three interpretable metrics--the performance realis...","url_abs":"https://arxiv.org/abs/2508.17162","url_pdf":"https://arxiv.org/pdf/2508.17162v1","authors":"[\"Songbo Hu\",\"Ivan Vulić\",\"Anna Korhonen\"]","published":"2025-08-23T23:25:38Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Language Model\"]","has_code":false}