Rylan Schaeffer

Logo
Resume
Research
Learning
Blog
Teaching
Jokes
Kernel Papers


Empirical Risk Minimization

Suppose there is a joint distribution \((x, y ) \sim D\) of interest, and we are given samples \(\{x_n , y_n \}_{n=1}^N\), and we would like to find a function/model/hypothesis/predictor \(h: \mathcal{X} \rightarrow \mathcal{Y}\) that is “close” to the true conditional relationship, where closeness is measured by some loss function \(\ell: \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}_+\).

The goal is to find \(h\) that minimizes

\[H(h) := \mathbb{E}_{(x,y) \sim D}[\ell(h(x), y)]\]

Define the excess risk for any hypothesis \(h\) as:

\[E(h) := L(h) - \inf_{g \in \mathcal{H}} L(g)\]

Empirical Risk Minimization says to choose the hypothesis \(\hat{h}\):

\[\hat{h} := \min_{h \in \mathcal{H}} \hat{L}(h) := \frac{1}{N} \sum_n \ell(h(x_n), y_n)\]

Asymptotic Excess Risk

A key question is: under what conditions does the excess risk of the empirical risk minimizer go to 0 as the number of samples \(N \rightarrow \infty\)?

\[E(\hat{h}^{(N)}) = f(N)\]

where \(f(N) \rightarrow 0\) as \(N \rightarrow \infty\). Equivalently, when does

\[\lim_{N \rightarrow \infty} L(\hat{h}^{(N)}) = \argmin_{h \in \mathcal{H}} L(\theta)\]

Example: Noiseless Linear Regression with Isotropic Gaussian Covariates

Let \(x_n \sim \mathcal{N}(0, I_D)\), and \(y_n = \langle x_n ,\theta^* \rangle\). Let \(\ell(a, b) = (a-b)^2\). The solution is given by the OLS estimator:

\[\hat{\theta} = (X^T X)^+ X^T X \theta^*\]

If \(N > D\), then \((X^T X)\) is invertible w.h.p, and the excess error will be 0. But if \(N \leq D\), then the expected loss is:

\[\begin{align*} \mathbb{E}_{x, X} L(\hat{h}) &= \mathbb{E}_{x, X} \langle x, I - (X^T X)^+ (X^T X) \theta^* \rangle\\ &= \mathbb{E}_{X} \lvert \lvert (I - (X^T X)^+ (X^T X)) \theta^* \lvert \lvert_2^2\\ &= (\theta^*)^T \mathbb{E}_{X} [(I - (X^T X)^+ (X^T X))] \theta^*\\ &= (1 - \frac{N}{D}) \lvert \lvert \theta^* \lvert \lvert_2^2 \end{align*}\]

And thus the excess risk will go to \(0\) as \(N \rightarrow D\). We actually don’t need to take the expectation over the training data \(X\). Why? The hat matrix \((X^T X)^+ (X^T X)\) will always have \(R\) ones and \(D-R\) zeros by virtue of being a projection matrix.

Properties