Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, Sanmi Koyejo

arXiv preprint Under Review

June 2024

PDF arXiv

Abstract

Predictable behavior from scaling AI systems is extremely desirable. While scaling laws are well established, how particular downstream capabilities scale is significantly muddier. We identify a new factor for widely-used multiple choice QA benchmarks: downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively deteriorate the statistical relationship between performance and scale.

Summary

Why predicting downstream capabilities from scale has remained elusive: the sequence of transformations to compute accuracy decorrelates performance from scale.

The Problem

Predictable behavior from scaling AI systems is extremely desirable. While scaling laws are well established for pretraining loss, how particular downstream capabilities scale is significantly muddier.

Why??

Scaling vs downstream capabilities

Our Discovery

We identify a new factor for widely-used multiple choice QA benchmarks (e.g., MMLU):

Downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively deteriorate the statistical relationship between performance and scale.

Transformation pipeline

The Mechanism

For a single sample in a benchmark, to compute Accuracy, NLLs are transformed into probabilities, then renormalized based on the available choices, then thresholded.

This sequence of transformations decorrelates performance from scale.

Score transformations

Evidence

In log probability space, we find that scores are highly correlated with compute.

High correlation in log space

But as we transform scores into probabilities, and then mask based on the incorrect choices, the correlations between scores and pretraining compute drops. These correlations fall further for Accuracy.

Correlation degradation

The Culprit: Incorrect Choices

What is the mechanism that causes this degradation? The incorrect choices! Measuring performance on these benchmarks requires comparing the correct choices to the specific incorrect choices, which wrecks predictability!

Incorrect choices analysis

What Models Do With Scale

What do models do for incorrect choices with increasing scale? We find that probability mass increasingly concentrates on both the correct AND incorrect choices (although the variance is quite high - several orders of magnitude!)

This is why predicting downstream capabilities from scale has remained elusive!

Probability concentration

See the full research page for more details.