Rylan Schaeffer

Logo
Resume
Publications
Learning
Blog
Teaching
Jokes
Kernel Papers


Pretraining Scaling Laws for Generative Evaluations of Language Models

Rylan Schaeffer, Noam Levi, Brando Miranda, Sanmi Koyejo

International Conference on Learning Representations Accepted

January 2026

Abstract

Neural scaling laws have driven the field's ever-expanding exponential growth in parameters, data and compute. While scaling behaviors for pretraining losses and discriminative benchmarks are well established, generative benchmarks such as mathematical problem-solving or software engineering remain under-explored. We propose and evaluate three different pretraining scaling laws for fitting pass-at-k on generative evaluations and for predicting pass-at-k of the most expensive model using cheaper models. Our three scaling laws differ in the covariates used: (1) pretraining compute, (2) model parameters and pretraining tokens, (3) log likelihoods of gold reference solutions.

Summary

Three scaling laws for predicting generative evaluation performance. Key finding: gold reference likelihoods are stable across 5 orders of magnitude.

Scaling laws for pretraining loss and discriminative benchmarks are well established. But what about generative tasks like math problem-solving? We propose and rigorously evaluate three scaling laws for predicting pass-at-\(k\) on generative benchmarks (GSM8K, MATH).

1/8: Compute Scaling Law

Our first scaling law fits pass-at-\(k\) as a function of pretraining compute:

\[-\log(\mathrm{pass}@k) = E_0(k) + C_0(k) \cdot C^{-\alpha(k)}\]

Compute Scaling Law

2/8: k as a Control Lever

A key discovery: the number of attempts \(k\) isn’t just a metric parameter—it’s a control lever that shapes the entire scaling law.

As \(k\) increases: irreducible error vanishes, and the scaling exponent steepens from ~0.12 to ~0.38.

Scaling Parameters vs k

3/8: Parameters + Tokens Law

Our second scaling law decomposes compute into parameters \(N\) and tokens \(D\):

\[-\log(\mathrm{pass}@k) = \mathcal{E}_0(k) + N_0(k) \cdot N^{-\beta(k)} + D_0(k) \cdot D^{-\gamma(k)}\]

This yields tighter in-range fits but similar predictive performance.

Parameters + Tokens Scaling

4/8: Gold Reference Likelihood Law

Our third scaling law uses gold reference likelihoods—how likely the model thinks the ground-truth solution is:

\[-\log(\mathrm{pass}@k) = \xi_0(k) + K_0(k) \cdot [-\log(\mathrm{GoldProb})]^{\kappa(k)}\]

Gold Reference Scaling

5/8: Backtesting Methodology

How predictive are these laws? We backtest: fit on cheap models, predict expensive models.

The compute and params+tokens laws require models within ~2 orders of magnitude of the target.

Backtesting Compute

6/8: Gold Reference Stability

The gold reference law is remarkably stable—parameters converge using models up to ~5 orders of magnitude cheaper than the target.

This suggests gold reference likelihoods provide a robust signal for long-range forecasting.

Backtesting Gold Reference

7/8: Theoretical Connection

We prove a theoretical connection: the compute law is the compute-optimal envelope of the params+tokens law.

Deviating from optimal allocation introduces a multiplicative penalty \(\Phi(r) \geq 1\). This quantifies exactly how much “effective compute” you lose by over/undertraining:

\[\frac{\text{Effective Compute}}{\text{Actual Compute}} = \Phi(r; \beta(k), \gamma(k))^{-1/\alpha(k)}\]

8/8: Key Takeaways

  • Pass-at-\(k\) follows predictable scaling laws for generative evaluations
  • \(k\) is a powerful lever that eliminates irreducible error and steepens scaling
  • Gold reference likelihoods are uniquely stable across 5 orders of magnitude
  • Compute scaling emerges as the optimal envelope of params+tokens scaling
  • The misallocation penalty precisely quantifies the cost of non-optimal training

See the full research page for more details.