{"ID":2857529,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.09295","arxiv_id":"2510.09295","title":"MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics","abstract":"Reliable evaluation is fundamental to the progress of Large Language Models (LLMs), yet the evaluation process during pre-training is plagued by significant instability that obscures true learning dynamics. In this work, we systematically diagnose this instability, attributing it to two distinct sources: \\textit{Parameter Instability} from training stochasticity and \\textit{Evaluation Instability} from noisy measurement protocols. To counteract both sources of noise, we introduce \\textbf{MaP}, a dual-pronged framework that synergistically integrates checkpoint \\underline{M}erging \\underline{a}nd the \\underline{P}ass@k metric. Checkpoint merging smooths the parameter space by averaging recent model weights, while Pass@k provides a robust, low-variance statistical estimate of model capability. Extensive experiments show that MaP yields significantly smoother performance curves, reduces inter-run variance, and ensures more consistent model rankings. Ultimately, MaP provides a more reliable and faithful lens for observing LLM training dynamics, laying a crucial empirical foundation for LLM research.","short_abstract":"Reliable evaluation is fundamental to the progress of Large Language Models (LLMs), yet the evaluation process during pre-training is plagued by significant instability that obscures true learning dynamics. In this work, we systematically diagnose this instability, attributing it to two distinct sources: \\textit{Parame...","url_abs":"https://arxiv.org/abs/2510.09295","url_pdf":"https://arxiv.org/pdf/2510.09295v2","authors":"[\"Jiapeng Wang\",\"Changxin Tian\",\"Kunlong Chen\",\"Ziqi Liu\",\"Jiaxin Mao\",\"Wayne Xin Zhao\",\"Zhiqiang Zhang\",\"Jun Zhou\"]","published":"2025-10-10T11:40:27Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
