Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data
Abstract
What happens when generative models are trained on their own outputs? Prior works foretold of a catastrophic feedback loop. We show that if data accumulate across model-fitting iterations rather than being replaced, model collapse can be avoided.
Summary
Model collapse is avoidable: accumulating synthetic data across iterations prevents degradation, unlike replacement.
Summary
What happens when generative models are trained on their own outputs?
Prior works foretold of a catastrophic feedback loop, a curse of recursion, descending into madness as models consume their own outputs. Are we poisoning the very data necessary to train future models?

Key Question: Many prior works consider training models solely on data generated by the preceding model (data are replaced at each iteration). Replacing data leads to collapse, but isn’t done in practice. What happens if data instead accumulate across iterations?

Results on Language Models:
We pretrain sequences of transformer-based language models on TinyStories:
- If data are replaced at each iteration → models worsen over time (collapse!)
- If data accumulate → collapse is avoided!

Results Generalize:
The same results hold for VAEs trained on images and diffusion models trained on molecular conformation generation. Replacing data leads to collapse, but accumulating data avoids collapse.
Theoretical Analysis:
In an analytically tractable setting (linear models), we prove:
- If data are replaced: test loss climbs linearly with iterations
- If data accumulate: test loss is upper bounded by a small constant
Implication: Model collapse may be avoided even in a pessimistic future where synthetic data are uncontrollably dumped on the internet.
