Mercer’s Theorem

Background

Recall that for a positive semi-definite (PSD) matrix \(K \in \mathbb{R}^{D \times D}\), the spectral theorem tells us that the matrix can be written as

\[K = \sum_{d=1}^D \lambda_d \psi_d \psi_d^T\]

where the eigenvalue-eigenvector pairs \((\lambda_d, \psi_d)\) are obtained by solving the characteristic polynomial:

\[det \Big( K \psi - \lambda \psi \Big) = 0\]

Mercer’s Theorem is a generalization to function spaces. Before jumping straight to Mercer’s theorem, we need to first define a notion of positive (semi-)definiteness in function spaces. This is called Mercer’s condition.

Mercer’s Condition

Mercer’s condition is the function-space equivalent of a PSD matrix. Specifically, let \(k: X \times X \rightarrow \mathbb{R}\) be a kernel function. The kernel function is said to meet Mercer’s condition if for all square-integrable functions i.e. \(\forall f \in L^2(X) := \{ f: \int f(i)^2 di < \infty \}\), the following quantity remains non-negative:

\[\int_i \int_j f(i) k(i, j) f(j) di dj \geq 0\]

The equivalent statement in the finite dimensional case is that \(\forall x \in \mathbb{R}^D\),

\[x^T K x \geq 0\]

In the integral, I use the arguments \(i, j\), to remind us that these are “indices” but for infinite dimensional functions.

Definition

Let \(X \subseteq \mathbb{R}^D\) be a compact set and let \(k: X \times X \rightarrow \mathbb{R}\) be a continuous PSD kernel function that meets Mercer’s condition. Then there exists a sequence \((\lambda_r)_{r \geq 1}\) and functions \(\psi_r(\cdot) \in L^2(X)\) such that

\[k(x_1, x_2) = \sum_{r=1}^{\infty} \lambda_r \psi_r(x_1) \psi_r(x_2)\]

Additionally, the functions are orthonormal i.e.

\[\langle \psi_a, \psi_b \rangle_{L^2(X)} = \int \psi_a(i) \psi_b(i) = \delta_{ab}\]

This is a generalization of the finite dimensional statement that

\[x^T K y = x^T \Psi^T \Lambda \Psi y = \sum_{r=1}^D \lambda_r (\phi_r^T x) (\phi_r^T y)\]

Obtaining Kernel Eigenfunctions

To obtain the eigenfunctions of the kernel \(k\), one must first define the Hilbert-Schmidt integral operator:

\[T_k(f)(x) := \int k(x, y) f(y) dy\]

where \(T_k: L^2(X) \rightarrow L^2(X)\) is a linear operator, and then solve the integral equation

\[T_k(f)(x) = \lambda f(x)\]

Feature Map

One can use Mercer’s Theorem to define a feature map using the kernel’s eigenfunctions. Specifically, let \(l^2(\mathbb{N}) := \{ (a_r)_r : \sum_r a_r^2 < \infty \}\) be the set of sequences that are square summable. Define the feature map \(\phi: X \rightarrow l^2(\mathbb{N})\) as:

\[\phi(x) = \Big(\sqrt{\lambda_r} \psi_r(x) \Big)_{r \geq 1}\]

In other words, each \(x \in X\) becomes a sequence \(\phi(x) \in l^2(\mathbb{N})\). If we take the inner product in this feature space, we recover the kernel:

\[\begin{align*} \langle \phi(x_1), \phi(x_2) \rangle_{l^2(\mathbb{N})} &= \sum_r \sqrt{\lambda_r} \psi_r(x_1) \sqrt{\lambda_r}\psi_r(x_2)\\ &= \sum_r \lambda_r \psi_r(x_1) \psi_r(x_2)\\ &= k(x_1, x_2) \end{align*}\]