{"ID":2869065,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.14498","arxiv_id":"2509.14498","title":"Data coarse graining can improve model performance","abstract":"Lossy data transformations by definition lose information. Yet, in modern machine learning, methods like data pruning and lossy data augmentation can help improve generalization performance. We study this paradox using a solvable model of high-dimensional, ridge-regularized linear regression under 'data coarse graining.' Inspired by the renormalization group in statistical physics, we analyze coarse-graining schemes that systematically discard features based on their relevance to the learning task. Our results reveal a nonmonotonic dependence of the prediction risk on the degree of coarse graining. A 'high-pass' scheme--which filters out less relevant, lower-signal features--can help models generalize better. By contrast, a 'low-pass' scheme that integrates out more relevant, higher-signal features is purely detrimental. Crucially, using optimal regularization, we demonstrate that this nonmonotonicity is a distinct effect of data coarse graining and not an artifact of double descent. Our framework offers a clear, analytical explanation for why careful data augmentation works: it strips away less relevant degrees of freedom and isolates more predictive signals. Our results highlight a complex, nonmonotonic risk landscape shaped by the structure of the data, and illustrate how ideas from statistical physics provide a principled lens for understanding modern machine learning phenomena.","short_abstract":"Lossy data transformations by definition lose information. Yet, in modern machine learning, methods like data pruning and lossy data augmentation can help improve generalization performance. We study this paradox using a solvable model of high-dimensional, ridge-regularized linear regression under 'data coarse graining...","url_abs":"https://arxiv.org/abs/2509.14498","url_pdf":"https://arxiv.org/pdf/2509.14498v1","authors":"[\"Alex Nguyen\",\"David J. Schwab\",\"Vudtiwat Ngampruetikorn\"]","published":"2025-09-18T00:17:01Z","proceeding":"cond-mat.stat-mech","tasks":"[\"cond-mat.stat-mech\",\"cond-mat.dis-nn\",\"cs.LG\",\"q-bio.NC\",\"stat.ML\"]","methods":"[]","has_code":false}
