Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle

Rylan Schaeffer, Mikail Khona, Zachary Robertson, Akhilan Boopathy, Kateryna Pistunova, Jason W Rocks, Ila Rani Fiete, Sanmi Koyejo

arXiv preprint / NeurIPS 2023 Workshops (ATTRIB, M3L) Under Review

March 2023

PDF arXiv Poster Tweeprint

Abstract

Why does double descent happen? This question has been studied for decades, but we provide the simplest possible explanation with the fewest assumptions. Using only linear regression and SVD, we identify 3 general interpretable factors and show all 3 are necessary for double descent to occur.

Summary

Identifying and ablating the sources of double descent using only linear regression and SVD - the simplest possible explanation.

The Puzzle

Why does double descent happen? This question has been studied for decades, but we wanted the simplest possible explanation with the fewest assumptions.

Double descent examples

Our Approach

Using only linear regression and SVD, we identify 3 general interpretable factors and show all 3 are necessary.

Three factors

Factor 1: Training Feature Variance

How much the training features X vary in each direction; more formally, the inverse (non-zero) singular values of the training features X.

This one is what the literature emphasizes, but it isn’t enough! Two other factors are also necessary.

Factor 1 ablation

Factor 2: Test-Train Feature Relationship

How much the test features vary relative to the training features X; more formally: how x_test projects onto X’s right singular vectors V.

Factor 2 ablation

Factor 3: Model Class Limitations

How well the best possible model in the model class can correlate the variance in the training features X with the training regression targets Y; more formally: how the residuals E of the best possible model project onto X’s left singular vectors U.

Factor 3 ablation

The Mechanism

When factors 1 & 3 occur, parameters along this mode are likely incorrect. When factor 2 is added by test data with large projection along this mode, the model is forced to extrapolate significantly beyond what it saw in the training data in an error-prone direction => test loss explodes.

Why Near the Interpolation Threshold?

We provide geometric intuition for why the smallest non-zero singular value probabilistically reaches its smallest value near the interpolation threshold.

Interpolation threshold geometry

Adversarial Training Data

We use this viewpoint to construct adversarial training data that destroy the model on test loss without noticeably affecting training loss. We can also explain adversarial test examples.

Adversarial training data

Clarifying Misconceptions

“Memorization” vs “Generalization” isn’t the right dichotomy. Memorizing solutions can generalize, and often do!
Noise/randomness is NOT necessary for double descent - what’s necessary is errors by the best possible model in the model class.

Overparameterized generalization

Impact

This work requires only linear regression and SVD (no random matrix theory, no replica calculations, no kernel methods) but offers simple, general and intuitive insights. We hope this material will be included in undergrad ML curricula as it’s so foundational!

See the full research page for more details.