Resume
Research
Learning
Blog
Teaching
Jokes
Kernel Papers

# Gaussian Processes

Parent: Stochastic Processes

A Gaussian process (GP) is a stochastic process such that any finite subset of the set is distributed as a multivariate Gaussian. Specifically, the set of random variables $$\{f(x) : x \in X \}$$ is said to be drawn from a GP with mean function $$\mu(\cdot)$$ and covariance function $$k(\cdot, cdot)$$ if for any $$x_1, ..., x_m \in X$$

$\begin{bmatrix} f(x_1) \\ \vdots \\ f(x_m) \end{bmatrix} \sim \mathcal{N} \Big( \begin{bmatrix} \mu(x_1) \\ \vdots \\ \mu(x_m) \end{bmatrix}, \begin{bmatrix} k(x_1, x_1) & \dots & k(x_1, x_m) \\ \vdots & & \vdots \\ k(x_m, x_1) & \dots & k(x_m, x_m) \end{bmatrix} \Big)$

The following notation is used as shorthand:

$f(\cdot) \sim \mathcal{GP}(\mu(\cdot), k(\cdot, \cdot))$

Gaussian processes are a generalization of a finite-dimensional multivariate Gaussian random vector. We can think of each $$f(x)$$ as the function $$f(\cdot)$$ evaluated at index element $$x$$. This is why GPs are frequently described as “distributions over functions.”

In order for a GP to be valid, the covariance function $$k(\cdot, \cdot)$$ must define a valid covariance matrix when evaluated at $$x_1, ..., x_m$$. Any function is therefore a valid covariance function if the kernel matrix $$K$$ is always positive semidefinite for any points $$x_1, ..., x_m$$.

## GP Regression

One common application of GPs is regression. GP Regression can be viewed as Bayesian regression where the parameters are marginalized out. As a warm-up, let’s quickly review how Bayesian linear regression works. Suppose we have some data set $$X \in \mathbb{R}^{n \times d}, Y \in \mathbb{R}^{n}, \theta \in \mathbb{R}^d$$ and we’d like to model $$Y$$ as a linear function of $$X$$ with additive errors $$\epsilon \in \mathbb{R}^n$$:

$Y = X \theta + \epsilon$

Assuming our data are sampled i.i.d., the parameter posterior is given by

$p(\theta | X, Y) = \frac{p(\theta) \prod_n p(y_n|\theta, x_n) }{\int_{\theta} p(\theta) \prod_n p(y_n|\theta, x_n) d\theta}$

and the posterior predictive is given by

$p(y_{new}|x_{new}, X, Y) = \int_{\theta} p(y_{new}|x_{new}, \theta) p(\theta | X, Y) d\theta$

If we place Gaussian priors on the errors $$\epsilon$$ and the parameters $$\theta$$, then both the parameter posterior and the posterior predictive are Gaussian. For instance, suppose

\begin{align*} \theta &\sim \mathcal{N}(0, \tau^2 I)\\ \epsilon &\sim \mathcal{N}(0, \sigma^2) \end{align*}

Then the parameter posterior and the posterior predictive are

\begin{align*} \theta | X, Y &\sim \mathcal{N}(\frac{1}{\sigma^2} A^{-1} X^T y, A^{-1})\\ y_{new} | x_{new}, X, Y &\sim \mathcal{N}(\frac{1}{\sigma^2} x_{new}^T A^{-1} X^T Y, x_{new}^T A^{-1} x_{new} + \sigma^2) \end{align*}

where $$A = \frac{1}{\sigma^2} X^T X + \frac{1}{\tau^2}I$$. With Gaussian processes, we can work directly with the posterior predictive because we know that for any training data $$(X_{train}, Y_{train})$$ and any test data $$(X_{test}, Y_{test})$$, the combination of the training and test data must have a joint Gaussian distribution. Specifically, suppose we model $$y_n$$ as some function of $$x_n$$ with additive Gaussian noise $$\epsilon_n \sim \mathcal{N}(0, \sigma^2)$$:

$y_n = f(x_n) + \epsilon_n$

If we place a zero mean Gaussian Process prior on the function $$f$$, then :

$f(\cdot) \sim \mathcal{GP}(0, k(\cdot, \cdot))$

Because the sum of independent Gaussian random variables is itself a random variable, we have that

$\begin{bmatrix}Y_{train} \\ Y_{test} \end{bmatrix} = \mathcal{N} \Big( \begin{bmatrix} \end{bmatrix} \begin{bmatrix} \end{bmatrix} \Big)$

For a more complete explanation, see Stanford CS299’s notes.

TODO: finish