Rylan Schaeffer

Logo
Resume
Publications
Learning
Blog
Teaching
Jokes
Kernel Papers


Are Emergent Abilities of Language Models a Mirage?

Rylan Schaeffer, Brando Miranda, Sanmi Koyejo

Advances in Neural Information Processing Systems Accepted Outstanding Paper

December 2023

Abstract

Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: their sharpness, transitioning seemingly instantaneously from not present to present, and their unpredictability, appearing at seemingly unforeseeable model scales. Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale.

Summary

Emergent abilities in LLMs may be a mirage created by metric choice, not fundamental model behavior changes.

Media Coverage

Summary

Recent work claims that LLMs display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: (1) sharpness (not present → present) and (2) unpredictability.

We ask whether emergent abilities might be better explained by the researcher’s choice of metric rather than fundamental changes in model behavior with scale.

Main Figure

Key Insight: When using nonlinear or discontinuous metrics (like exact-match accuracy), smooth, continuous improvements in model performance can appear as sharp, discontinuous “emergent” abilities. By changing to continuous metrics (like token-level accuracy or Brier score), the apparent emergence disappears and is replaced by smooth, predictable improvement.

Implications:

  1. Claims of emergent abilities should be scrutinized for metric choice
  2. Smooth scaling laws may underlie seemingly unpredictable capabilities
  3. The field should prefer continuous metrics when possible