{"ID":2878859,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.17256","arxiv_id":"2508.17256","title":"Provable Generalization in Overparameterized Neural Nets","abstract":"Deep neural networks often contain far more parameters than training examples, yet they still manage to generalize well in practice. Classical complexity measures such as VC-dimension or PAC-Bayes bounds usually become vacuous in this overparameterized regime, offering little explanation for the empirical success of models like Transformers. In this work, I explore an alternative notion of capacity for attention-based models, based on the effective rank of their attention matrices. The intuition is that, although the parameter count is enormous, the functional dimensionality of attention is often much lower. I show that this quantity leads to a generalization bound whose dependence on sample size matches empirical scaling laws observed in large language models, up to logarithmic factors. While the analysis is not a complete theory of overparameterized learning, it provides evidence that spectral properties of attention, rather than raw parameter counts, may be the right lens for understanding why these models generalize.","short_abstract":"Deep neural networks often contain far more parameters than training examples, yet they still manage to generalize well in practice. Classical complexity measures such as VC-dimension or PAC-Bayes bounds usually become vacuous in this overparameterized regime, offering little explanation for the empirical success of mo...","url_abs":"https://arxiv.org/abs/2508.17256","url_pdf":"https://arxiv.org/pdf/2508.17256v1","authors":"[\"Aviral Dhingra\"]","published":"2025-08-24T08:46:31Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"stat.ML\"]","methods":"[\"Transformer\",\"Language Model\"]","has_code":false}
