{"ID":2854018,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.16074","arxiv_id":"2510.16074","title":"Early-stopping for Transformer model training","abstract":"This work, based on Random Matrix Theory (RMT), introduces a novel early-stopping strategy for Transformer training dynamics. Utilizing the Power Law (PL) fit to tansformer attention matrices as a probe, we demarcate training into three stages: structural exploration, heavy-tailed structure stabilization, and convergence saturation. Empirically, we observe that the spectral density of the shallow self-attention matrix $V$ consistently evolves into a heavy-tailed distribution. Crucially, we propose two consistent and validation-set-free criteria: a quantitative metric for heavy-tailed dynamics and a novel spectral signature indicative of convergence. The strong alignment between these criteria highlights the utility of RMT for monitoring and diagnosing the progression of Transformer model training.","short_abstract":"This work, based on Random Matrix Theory (RMT), introduces a novel early-stopping strategy for Transformer training dynamics. Utilizing the Power Law (PL) fit to tansformer attention matrices as a probe, we demarcate training into three stages: structural exploration, heavy-tailed structure stabilization, and convergen...","url_abs":"https://arxiv.org/abs/2510.16074","url_pdf":"https://arxiv.org/pdf/2510.16074v2","authors":"[\"Jing He\",\"Hua Jiang\",\"Cheng Li\",\"Siqian Xin\",\"Shuzhen Yang\"]","published":"2025-10-17T09:55:46Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Transformer\",\"LoRA\"]","has_code":false}