{"ID":2825920,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.20562","arxiv_id":"2512.20562","title":"Shallow Neural Networks Learn Low-Degree Spherical Polynomials with Feature Learning by Learnable Channel Attention","abstract":"We study the problem of learning a low-degree spherical polynomial of degree $\\ell_0 = Θ(1) \\ge 1$ defined on the unit sphere in $\\RR^d$ by training an over-parameterized two-layer neural network (NN) with channel attention in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk $\\eps \\in (0,1)$, a carefully designed two-layer NN with channel attention and finite width trained by the vanilla gradient descent (GD) requires the lowest sample complexity of $n \\asymp Θ(d^{\\ell_0}/\\eps)$ with high probability, in contrast with the representative sample complexity $Θ\\pth{d^{\\ell_0} \\max\\set{\\eps^{-2},\\log d}}$, where $n$ is the training data size. Moreover, such sample complexity is not improvable since the trained network renders a sharp rate of the nonparametric regression risk of the order $Θ(d^{\\ell_0}/{n})$ with high probability. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank $Θ(d^{\\ell_0})$ is $Θ(d^{\\ell_0}/{n})$, so that the rate of the nonparametric regression risk of the network trained by GD is minimax optimal. Training the two-layer NN with channel attention proceeds in two stages: (1) a provable learnable channel selection algorithm, as a learnable harmonic-degree selection process, identifies the ground truth channel number in the target function, $\\ell_0$, from $L \\ge \\ell_0$ channels in the first-layer activation; (2) the second layer is trained by standard GD using the selected channels. To the best of our knowledge, this is the first time a minimax optimal risk bound is obtained by training an over-parameterized but finite-width neural network with feature learning capability to learn low-degree spherical polynomials.","short_abstract":"We study the problem of learning a low-degree spherical polynomial of degree $\\ell_0 = Θ(1) \\ge 1$ defined on the unit sphere in $\\RR^d$ by training an over-parameterized two-layer neural network (NN) with channel attention in this paper. Our main result is the significantly improved sample complexity for learning such...","url_abs":"https://arxiv.org/abs/2512.20562","url_pdf":"https://arxiv.org/pdf/2512.20562v2","authors":"[\"Yingzhen Yang\"]","published":"2025-12-23T18:05:55Z","proceeding":"stat.ML","tasks":"[\"stat.ML\",\"cs.LG\",\"math.OC\"]","methods":"[]","has_code":false}