{"ID":2860315,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.04212","arxiv_id":"2510.04212","title":"Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention","abstract":"The pursuit of computational efficiency has driven the adoption of low-precision formats for training transformer models. However, this progress is often hindered by notorious training instabilities. This paper provides the first mechanistic explanation for a long-standing and unresolved failure case where training with flash attention in low-precision settings leads to catastrophic loss explosion. Our in-depth analysis reveals that the failure is not a random artifact but caused by two intertwined phenomena: the emergence of similar low-rank representations within the attention mechanism and the compounding effect of biased rounding errors inherent in low-precision arithmetic. We demonstrate how these factors create a vicious cycle of error accumulation that corrupts weight updates, ultimately derailing the training dynamics. To validate our findings, we introduce a minimal modification to the flash attention that mitigates the bias in rounding errors. This simple change stabilizes the training process, confirming our analysis and offering a practical solution to this persistent problem. Code is available at https://github.com/ucker/why-low-precision-training-fails.","short_abstract":"The pursuit of computational efficiency has driven the adoption of low-precision formats for training transformer models. However, this progress is often hindered by notorious training instabilities. This paper provides the first mechanistic explanation for a long-standing and unresolved failure case where training wit...","url_abs":"https://arxiv.org/abs/2510.04212","url_pdf":"https://arxiv.org/pdf/2510.04212v3","authors":"[\"Haiquan Qiu\",\"Quanming Yao\"]","published":"2025-10-05T14:01:24Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Transformer\"]","has_code":false,"code_links":[{"ID":608714,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2860315,"paper_url":"https://arxiv.org/abs/2510.04212","paper_title":"Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention","repo_url":"https://github.com/ucker/why-low-precision-training-fails","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}