{"ID":2856528,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.11789","arxiv_id":"2510.11789","title":"Minimax Rates for Learning Pairwise Interactions in Attention-Style Models","abstract":"We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a nonlinear activation function. We prove that the minimax rate is $M^{-\\frac{2β}{2β+1}}$, where $M$ is the sample size and $β$ is the Hölder smoothness of the activation function. Importantly, this rate is independent of the embedding dimension $d$, the number of tokens $N$, and the rank $r$ of the weight matrix, provided that $rd \\le (M/\\log M)^{\\frac{1}{2β+1}}$. These results highlight a fundamental statistical efficiency of attention-style models, even when the weight matrix and activation are not separately identifiable, and provide a theoretical understanding of attention mechanisms and guidance on training.","short_abstract":"We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a nonlinear activation function. We prove that the minimax rate is $M^{-\\frac{2β}{2β+1}}$, where $M$ is the sample size and $β$ is the Hölder smoothness of the activa...","url_abs":"https://arxiv.org/abs/2510.11789","url_pdf":"https://arxiv.org/pdf/2510.11789v2","authors":"[\"Shai Zucker\",\"Xiong Wang\",\"Fei Lu\",\"Inbar Seroussi\"]","published":"2025-10-13T18:00:04Z","proceeding":"stat.ML","tasks":"[\"stat.ML\",\"cs.LG\",\"math.PR\",\"math.ST\"]","methods":"[]","has_code":false}