Comments on Shumailov et al. "AI models collapse when trained on recursively generated data." Nature 2024.


Yesterday, I tweeted that model collapse appears when researchers intentionally induce it in ways that don’t match what is done in practice
Let me explain using the Shumailov et al. @Nature 2024 paper’s methodology as an example
Paper: https://nature.com/articles/s41586-024-07566-y
π§΅β¬οΈ
1/N
Model collapse arose from asking: what happens when synthetic data from previous generative models enters the pretraining data supply used to train new generative model?
I like Shumailov et al.’s phrasing:
“What happens to GPT generations GPT-{n} as n increases?”
2/N
Let’s identify realistic pretraining conditions for frontier AI models to make sure we study the correct setting
Amount of data: π Llama went from 1.4T tokens to 2T tokens to 15T tokens
Amount of chips: π Llama went from 2k to 4k to 16k GPUs https://x.com/NamanGoyal21/status/1815819622525870223
3/N
Quality of data: π models are better performing, users mainly share higher quality outputs, pretraining data teams make better filters
Data Accumulates: π This may seem obvious, but synthetic data from GPT4, Claude, Gemini, etc. is added to existing data (βΌοΈ)
4/N
To be clear, we’re concerned what will happen to frontier AI models pretrained on web-scaled datasets using industry best-practices
Thus, we want to study settings w/:
more data over time
Data for training a new model contains (much of) the original data
5/N
Let’s now turn to the Nature 2024 paper on Model Collapse
The 1st (theoretical) result studies repeatedly:
- fitting means and covariances to data
- sampling new data from Normal(fit mean, fit cov)
Result: iteratively doing this causes fit covariance to collapse to 0
6/N
Note - at each iteration, all prior data is thrown away βΌοΈ
For the 2nd and 3rd (empirical) results, both similarly assume that (all or most) data is thrown away
7/N
The 2nd (empirical) result:
throws away all data after each iteration (not what people do in practiceβ)
uses fixed dataset size (again, not what is done in practice β)
8/N
The 3rd (empirical) result keeps rand 10% of original data but replaces other 90% & uses same unrealistic decisions otherwise:
most data is still replaced β
dataset size is constant β
But note - by adding a little real data, they already see lower test perplexity!
9/N
What did our COLM 2024 paper do differently? We
KEEP the full original dataset β
ADD new synthetic data to the overall accumulating data β
If we do this, i.e. data accumulate (right) no model collapse β
If we don’t, i.e., data are replaced (left) , model collapse β
10/N
Our main finding is that if synthetic data is added to real data, then model collapse is mitigated
We show this holds across domains (text, vision, molecular conformation) & models (transformers, VAEs, diffusion)
We also prove this analytically in linear regression
11/N
Let’s return to the theoretical result of Shumailov et al. 2024
What if the new synthetic data is instead added to existing data?
Last night, I ran new simulations in their exact setting
Data replaced -> Collapse β
Data accumulated -> Avoided β
12/N
People are worried that widespread public usage of generative models might destroy future models, e.g.,
https://scientificamerican.com/article/ai-generated-data-can-poison-future-ai-models/
Because of the Nature paper, we were repeatedly asked by journalists: “How can we (humanity) make sure to avoid model collapse?”
13/N
The question is flawed. It presumes model collapse is a real and significant threat under current best practices
Based on the evidence I’ve seen, it isn’t
If there are other methodological questions about this Nature paper, our COLM 2024 paper or other papers, please ask!
13/N
@YangjunR @moinnadeem @sj_manning @gabemukobi
Hopefully this helps! Thanks for prompting me to put this together π
If there are any remaining questions, please let me know!
14/N