Are Emergent Abilities of Language Models a Mirage?
Abstract
Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: their sharpness, transitioning seemingly instantaneously from not present to present, and their unpredictability, appearing at seemingly unforeseeable model scales. Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale.
Summary
Emergent abilities in LLMs may be a mirage created by metric choice, not fundamental model behavior changes.
Media Coverage
Summary
Recent work claims that LLMs display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: (1) sharpness (not present → present) and (2) unpredictability.
We ask whether emergent abilities might be better explained by the researcher’s choice of metric rather than fundamental changes in model behavior with scale.

Key Insight: When using nonlinear or discontinuous metrics (like exact-match accuracy), smooth, continuous improvements in model performance can appear as sharp, discontinuous “emergent” abilities. By changing to continuous metrics (like token-level accuracy or Brier score), the apparent emergence disappears and is replaced by smooth, predictable improvement.
Implications:
- Claims of emergent abilities should be scrutinized for metric choice
- Smooth scaling laws may underlie seemingly unpredictable capabilities
- The field should prefer continuous metrics when possible
