{"ID":2864424,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.24012","arxiv_id":"2509.24012","title":"Pretraining Scaling Laws for Generative Evaluations of Language Models","abstract":"Neural scaling laws have driven the field's ever-expanding exponential growth in parameters, data and compute. While scaling behaviors for pretraining losses and discriminative benchmarks are well established, generative benchmarks such as mathematical problem-solving or software engineering remain under-explored. We propose and evaluate three different pretraining scaling laws for fitting pass-at-$k$ on generative evaluations and for predicting pass-at-$k$ of the most expensive model using cheaper models. Our three scaling laws differ in the covariates used: (1) pretraining compute, (2) model parameters and pretraining tokens, (3) log likelihoods of gold reference solutions. First, we demonstrate that generative evaluations introduce new hyperparameters (in our setting, $k$) that act as a control lever for scaling behavior, modulating both the scaling law parameters and the predictability of performance. Second, we identify a stark difference in parameter stability: while the compute and parameters+tokens laws stabilize for only the last $\\mathord{\\sim}1.5\\mathord{-}2.5$ orders of magnitude, the gold reference likelihood law is uniquely stable, converging across $\\mathord{\\sim}5$ orders. Third, in terms of predictive performance, we find all three scaling laws perform comparably, although the compute law predicts slightly worse for small $k$ and the gold reference law predicts slightly worse for large $k$. Finally, we establish a theoretical connection, proving that the compute scaling law emerges as the compute-optimal envelope of the parameters-and-tokens law. Our framework provides researchers and practitioners with insights and methodologies to forecast generative performance, accelerating progress toward models that can reason, solve, and create.","short_abstract":"Neural scaling laws have driven the field's ever-expanding exponential growth in parameters, data and compute. While scaling behaviors for pretraining losses and discriminative benchmarks are well established, generative benchmarks such as mathematical problem-solving or software engineering remain under-explored. We p...","url_abs":"https://arxiv.org/abs/2509.24012","url_pdf":"https://arxiv.org/pdf/2509.24012v2","authors":"[\"Rylan Schaeffer\",\"Noam Levi\",\"Brando Miranda\",\"Sanmi Koyejo\"]","published":"2025-09-28T18:04:18Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false}