{"ID":2828964,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.13330","arxiv_id":"2512.13330","title":"FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models","abstract":"We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at https://github.com/LumiOpen/lm-evaluation-harness. Supplementary resources are released in a separate repository at https://github.com/TurkuNLP/FIN-bench-v2.","short_abstract":"We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice an...","url_abs":"https://arxiv.org/abs/2512.13330","url_pdf":"https://arxiv.org/pdf/2512.13330v1","authors":"[\"Joona Kytöniemi\",\"Jousia Piha\",\"Akseli Reunamo\",\"Fedor Vitiugin\",\"Farrokh Mehryary\",\"Sampo Pyysalo\"]","published":"2025-12-15T13:41:41Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":605917,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2828964,"paper_url":"https://arxiv.org/abs/2512.13330","paper_title":"FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models","repo_url":"https://github.com/LumiOpen/lm-evaluation-harness","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0},{"ID":605918,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2828964,"paper_url":"https://arxiv.org/abs/2512.13330","paper_title":"FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models","repo_url":"https://github.com/TurkuNLP/FIN-bench-v2","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}