{"ID":2874784,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.04653","arxiv_id":"2509.04653","title":"Deriving Transformer Architectures as Implicit Multinomial Regression","abstract":"While attention has been empirically shown to improve model performance, it lacks a rigorous mathematical justification. This short paper establishes a novel connection between attention mechanisms and multinomial regression. Specifically, we show that in a fixed multinomial regression setting, optimizing over latent features yields solutions that align with the dynamics induced on features by attention blocks. In other words, the evolution of representations through a transformer can be interpreted as a trajectory that recovers the optimal features for classification.","short_abstract":"While attention has been empirically shown to improve model performance, it lacks a rigorous mathematical justification. This short paper establishes a novel connection between attention mechanisms and multinomial regression. Specifically, we show that in a fixed multinomial regression setting, optimizing over latent f...","url_abs":"https://arxiv.org/abs/2509.04653","url_pdf":"https://arxiv.org/pdf/2509.04653v2","authors":"[\"Jonas A. Actor\",\"Anthony Gruber\",\"Eric C. Cyr\"]","published":"2025-09-04T20:40:37Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"math.NA\"]","methods":"[\"Transformer\"]","has_code":false}