Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?
Abstract
Predictable behavior from scaling AI systems is extremely desirable. While scaling laws are well established, how particular downstream capabilities scale is significantly muddier. We identify a new factor for widely-used multiple choice QA benchmarks: downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively deteriorate the statistical relationship between performance and scale.
Summary
Why predicting downstream capabilities from scale has remained elusive: the sequence of transformations to compute accuracy decorrelates performance from scale.
The Problem
Predictable behavior from scaling AI systems is extremely desirable. While scaling laws are well established for pretraining loss, how particular downstream capabilities scale is significantly muddier.
Why??

Our Discovery
We identify a new factor for widely-used multiple choice QA benchmarks (e.g., MMLU):
Downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively deteriorate the statistical relationship between performance and scale.

The Mechanism
For a single sample in a benchmark, to compute Accuracy, NLLs are transformed into probabilities, then renormalized based on the available choices, then thresholded.
This sequence of transformations decorrelates performance from scale.

Evidence
In log probability space, we find that scores are highly correlated with compute.

But as we transform scores into probabilities, and then mask based on the incorrect choices, the correlations between scores and pretraining compute drops. These correlations fall further for Accuracy.

The Culprit: Incorrect Choices
What is the mechanism that causes this degradation? The incorrect choices! Measuring performance on these benchmarks requires comparing the correct choices to the specific incorrect choices, which wrecks predictability!

What Models Do With Scale
What do models do for incorrect choices with increasing scale? We find that probability mass increasingly concentrates on both the correct AND incorrect choices (although the variance is quite high - several orders of magnitude!)
This is why predicting downstream capabilities from scale has remained elusive!

See the full research page for more details.
