Pretraining Scaling Laws for Generative Evaluations of Language Models
Abstract
Neural scaling laws have driven the field's ever-expanding exponential growth in parameters, data and compute. While scaling behaviors for pretraining losses and discriminative benchmarks are well established, generative benchmarks such as mathematical problem-solving or software engineering remain under-explored. We propose and evaluate three different pretraining scaling laws for fitting pass-at-k on generative evaluations and for predicting pass-at-k of the most expensive model using cheaper models. Our three scaling laws differ in the covariates used: (1) pretraining compute, (2) model parameters and pretraining tokens, (3) log likelihoods of gold reference solutions.
Summary
Three scaling laws for predicting generative evaluation performance. Key finding: gold reference likelihoods are stable across 5 orders of magnitude.
Scaling laws for pretraining loss and discriminative benchmarks are well established. But what about generative tasks like math problem-solving? We propose and rigorously evaluate three scaling laws for predicting pass-at-\(k\) on generative benchmarks (GSM8K, MATH).
1/8: Compute Scaling Law
Our first scaling law fits pass-at-\(k\) as a function of pretraining compute:
\[-\log(\mathrm{pass}@k) = E_0(k) + C_0(k) \cdot C^{-\alpha(k)}\]
2/8: k as a Control Lever
A key discovery: the number of attempts \(k\) isn’t just a metric parameter—it’s a control lever that shapes the entire scaling law.
As \(k\) increases: irreducible error vanishes, and the scaling exponent steepens from ~0.12 to ~0.38.

3/8: Parameters + Tokens Law
Our second scaling law decomposes compute into parameters \(N\) and tokens \(D\):
\[-\log(\mathrm{pass}@k) = \mathcal{E}_0(k) + N_0(k) \cdot N^{-\beta(k)} + D_0(k) \cdot D^{-\gamma(k)}\]This yields tighter in-range fits but similar predictive performance.

4/8: Gold Reference Likelihood Law
Our third scaling law uses gold reference likelihoods—how likely the model thinks the ground-truth solution is:
\[-\log(\mathrm{pass}@k) = \xi_0(k) + K_0(k) \cdot [-\log(\mathrm{GoldProb})]^{\kappa(k)}\]
5/8: Backtesting Methodology
How predictive are these laws? We backtest: fit on cheap models, predict expensive models.
The compute and params+tokens laws require models within ~2 orders of magnitude of the target.

6/8: Gold Reference Stability
The gold reference law is remarkably stable—parameters converge using models up to ~5 orders of magnitude cheaper than the target.
This suggests gold reference likelihoods provide a robust signal for long-range forecasting.

7/8: Theoretical Connection
We prove a theoretical connection: the compute law is the compute-optimal envelope of the params+tokens law.
Deviating from optimal allocation introduces a multiplicative penalty \(\Phi(r) \geq 1\). This quantifies exactly how much “effective compute” you lose by over/undertraining:
\[\frac{\text{Effective Compute}}{\text{Actual Compute}} = \Phi(r; \beta(k), \gamma(k))^{-1/\alpha(k)}\]8/8: Key Takeaways
- Pass-at-\(k\) follows predictable scaling laws for generative evaluations
- \(k\) is a powerful lever that eliminates irreducible error and steepens scaling
- Gold reference likelihoods are uniquely stable across 5 orders of magnitude
- Compute scaling emerges as the optimal envelope of params+tokens scaling
- The misallocation penalty precisely quantifies the cost of non-optimal training
See the full research page for more details.
