Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle
Abstract
Why does double descent happen? This question has been studied for decades, but we provide the simplest possible explanation with the fewest assumptions. Using only linear regression and SVD, we identify 3 general interpretable factors and show all 3 are necessary for double descent to occur.
Summary
Identifying and ablating the sources of double descent using only linear regression and SVD - the simplest possible explanation.
The Puzzle
Why does double descent happen? This question has been studied for decades, but we wanted the simplest possible explanation with the fewest assumptions.

Our Approach
Using only linear regression and SVD, we identify 3 general interpretable factors and show all 3 are necessary.

Factor 1: Training Feature Variance
How much the training features X vary in each direction; more formally, the inverse (non-zero) singular values of the training features X.
This one is what the literature emphasizes, but it isn’t enough! Two other factors are also necessary.

Factor 2: Test-Train Feature Relationship
How much the test features vary relative to the training features X; more formally: how x_test projects onto X’s right singular vectors V.

Factor 3: Model Class Limitations
How well the best possible model in the model class can correlate the variance in the training features X with the training regression targets Y; more formally: how the residuals E of the best possible model project onto X’s left singular vectors U.

The Mechanism
When factors 1 & 3 occur, parameters along this mode are likely incorrect. When factor 2 is added by test data with large projection along this mode, the model is forced to extrapolate significantly beyond what it saw in the training data in an error-prone direction => test loss explodes.
Why Near the Interpolation Threshold?
We provide geometric intuition for why the smallest non-zero singular value probabilistically reaches its smallest value near the interpolation threshold.

Adversarial Training Data
We use this viewpoint to construct adversarial training data that destroy the model on test loss without noticeably affecting training loss. We can also explain adversarial test examples.

Clarifying Misconceptions
-
“Memorization” vs “Generalization” isn’t the right dichotomy. Memorizing solutions can generalize, and often do!
-
Noise/randomness is NOT necessary for double descent - what’s necessary is errors by the best possible model in the model class.

Impact
This work requires only linear regression and SVD (no random matrix theory, no replica calculations, no kernel methods) but offers simple, general and intuitive insights. We hope this material will be included in undergrad ML curricula as it’s so foundational!
See the full research page for more details.
