Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Tomasz Korbak, Rajashree Agrawal, Henry Sleight, John Hughes, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo

arXiv preprint Under Review

April 2024

PDF arXiv Tweeprint

Abstract

What happens when generative models are trained on their own outputs? Prior works foretold of a catastrophic feedback loop. We show that if data accumulate across model-fitting iterations rather than being replaced, model collapse can be avoided.

Summary

Model collapse is avoidable: accumulating synthetic data across iterations prevents degradation, unlike replacement.

Summary

What happens when generative models are trained on their own outputs?

Prior works foretold of a catastrophic feedback loop, a curse of recursion, descending into madness as models consume their own outputs. Are we poisoning the very data necessary to train future models?

Key Question: Many prior works consider training models solely on data generated by the preceding model (data are replaced at each iteration). Replacing data leads to collapse, but isn’t done in practice. What happens if data instead accumulate across iterations?

Results on Language Models:

We pretrain sequences of transformer-based language models on TinyStories:

If data are replaced at each iteration → models worsen over time (collapse!)
If data accumulate → collapse is avoided!

Results Generalize:

The same results hold for VAEs trained on images and diffusion models trained on molecular conformation generation. Replacing data leads to collapse, but accumulating data avoids collapse.

Theoretical Analysis:

In an analytically tractable setting (linear models), we prove:

If data are replaced: test loss climbs linearly with iterations
If data accumulate: test loss is upper bounded by a small constant

Implication: Model collapse may be avoided even in a pessimistic future where synthetic data are uncontrollably dumped on the internet.