Definition 1: Let \(X\) be a non-empty set. A function \(k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}\) is a kernel function if there exists an \(\mathbb{R}\)-Hibert space \(H\) and a so-called feature map \(\Phi: X \rightarrow H\) such that \(\forall x, x' \in X\),
\[k(x, x') = \langle \Phi(x), \Phi(x') \rangle\]Definition 2: A p.d. kernel on set \(X\) is a function \(K: X \times X \rightarrow \mathbb{R}\) that is (1) symmetric and (b) satisfies for all \((x_1, ..., x_N)\) and \((a_1, ..., a_n) \in \mathbb{R}^N\):
\[\sum_{ij} a_i a_j K(x_i, x_j) \geq 0\]Equivalently, for all \((x_1, ..., x_N)\), the matrix \([K]_{ij} := K(x_i, x_j)\) is positive semi-definite.
Lemma: Let \(X\) be any set and \(\phi: X \rightarrow \mathbb{R}^D\). Then the function \(K(x_1, x_2) := \langle \phi(x_1), \phi(x_2) \rangle\) is a kernel function. In simple terms, a feature map and an inner product in a Hilbert space give rise to a kernel function.
Proof: Symmetry is obvious. Consider
\[\begin{align*} \sum_{ij} a_i a_j K(x_i, x_j) &= \sum_{ij} a_i a_j \langle \phi(x_i), \phi(x_j) \rangle\\ &= \langle \sum_i a_i \phi(x_i), \sum_j a_j \phi(x_j) \rangle \\ &= \lvert \lvert \sum_i a_i \phi(x_i) \lvert \lvert_2^2 \\ &\geq 0 \end{align*}\]Theorem (Aronszajn 1950): Informally, the opposite direction is also true: a kernel function implies the existence of a feature map and inner product in some Hilbert space. Formally, \(K\) is a PSD kernel on \(X\) if and only if there exists a Hilbert sapce \(H\) and mapping \(\phi: X \rightarrow H\) such that \(K(x_1, x_2) = \langle \phi(x_1), \phi(x_2) \rangle\)
Proof (Finite Dimensions): Let \(K\) be a PSD kernel on \(X\). Our goal is to show that there exists features in some Hilbert space and an inner product. Because \(K\) is a PSD kernel, we know that for any \(x_1, ..., X_N\), the \(N \times N\) matrix is positive semi-definite. Take its eigendecomposition \(K = \sum_n \lambda_n v_n v_n^T\) and consider its \(i, j\)th element: \([K]_{ij} = \sum_n \lambda_n [v_n]_i [v_n]_j\). Define \(\phi(x_i) = \begin{bmatrix} \sqrt{\lambda_1} v_{1,i} \\ \vdots \\\sqrt{\lambda_1} v_{N,i} \end{bmatrix}\). Then \([K]_{ij} = \langle \phi(x_i), \phi(x_j) \rangle\) by construction.
Exponential: \(k(x, x') = \exp(\langle x, x' \rangle)\)
Gaussian: \(k(x,x') = \exp(-\gamma^{-2} \langle x - x', x - x' \rangle)\)
Addition: The sum of two kernels is a kernel. The difference of two kernels is not necessarily a kernel (as the difference may violate that inner products remain non-negative).
Positive Scalar Multiplication: A kernel multiplied by a positive scalar is a kernel.
Mapping Between Spaces: If \(X, \tilde{X}\) are sets, \(k\) is a kernel on \(\tilde{X}\) and \(A: X \rightarrow \tilde{X}\) is a map, then \(k(A(x), A(x'))\) is a kernel on \(X\).
Multiplication: If \(k_1\) on \(X_1\) and \(k_2\) on \(X_2\) are kernels, then \(k_1 \times k_2\) is a kernel on \(X_1 \times X_2\). If \(X_1=X_2=X\), then \(k = k_1 \times k_2\) is a kernel on \(X\).
Polynomials: Let \(\Phi(x), \Phi(x') \in \mathbb{R}^d\) for \(d \geq 1\) and let \(m \geq 1\) be an integer and \(c \geq 0\). Then \(k(\Phi(x), \Phi(x')) = (\langle \Phi(x), \Phi(x') \rangle + c)^m\) is a valid kernel.
Taylor Series: Combining polynomials with positive scalar multiplication shows that Taylor series expansions with positive coefficients are valid kernels.
Positive Definiteness: see below
Suppose we’re given a function with two arguments \(k(x, x')\). How can we determine whether the function is a valid kernel? We might consider two options.
Check the positive definiteness of the kernel.
Find a feature map \(\Phi\) satisfying the requirement that \(k(x,x') = \langle \Phi(x),\Phi(x') \rangle\).
We’ll consider each in turn.
One option to check whether a symmetric function \(k(x,x')\) is a kernel is to check whether the function is positive definite. A symmetric function \(k: X \times X \rightarrow \mathbb{R}\) is positive definite if \(\forall n \geq 1, \, \forall (a_1, ..., a_n) \in \mathbb{R}^N,\, \forall (x_1,...,x_N) \in X^n\)
\[\sum_{i=1}^N \sum_{j=1}^N a_i a_j k(x_i, x_j) \geq 0\]Note: The terminology here can vary. Some say that the function is strictly positive definite if equality holds only when all the \(a_i\) are zero for mutually distinct \(x_i\), whereas others call the above definition positive semi-definite and then label strictly positive definite as positive definite. I use the first terminology because it agrees with Arthur Gretton’s usage, whose videos I watched to learn this subject. The relevance of positive definiteness is that if a function is a kernel, then it must be positive definite:
Theorem: Let \(X\) be a non-empty set, \(H\) be a Hilbert space and \(\Phi: X \rightarrow H\). Then \(k(x,x') = \langle \Phi(x),\Phi(x')\rangle_H\) is positive definite.
TODO: Show proof that converse is true i.e. that in positive definite \(k(x,x')\) is inner product in a unique \(\mathcal{H}\).
Positive definite (PD) kernels are also called reproducing kernels and we’re about to understand why. The key idea is that any PD kernel can be viewed as a dot product in some space.
Suppose we’re given a kernel \(k(\cdot, cdot)\). How can we see that this kernel defines a dot product in some space? We need three ingredients: a vector/function space, a function \(\langle \cdot, \cdot \rangle : X \times X \rightarrow \mathbb{R}\) and a proof that this function is indeed an inner product.
First, we’ll define the vector/function space, which is also commonly called the feature space. We define a function \(\Phi: X \rightarrow \mathbb{R}^{X}\), where \(\mathbb{R}^{X}\) denotes a function mapping \(X \rightarrow R\). This means that for any \(x \in X\), we can define a vector/function \(\Phi(x) = k(\cdot, x)\). I despise this notation, called the canonical feature map, because the notation obscures that \(\Phi(x)\) is a function with an unfilled argument. If we’re given a set of \(\{x_i\}_{i}\), we can then define a function space as the span of the corresponding feature maps:
\[\text{span}\{ \Phi(x_i) \} = \text{span} \{ k(\cdot, x_i) \}\]Second, we need to define an inner product. Let \(f(\cdot) = \sum_i \alpha_i \Phi(x_i)\) and let \(g(\cdot) = \sum_i \beta_i \Phi(x_i)\) be two functions in our function space. We define the inner product as:
\[\langle f, g \rangle = \sum_i \sum_j \alpha_i \overbar{\beta_j} k(x_i, x_j)\]Third, we show that this inner product is indeed an inner product. This requires meeting three properties: Hermitian symmetry, conjugate bilinearity and non-negativity, with equality to zero implying the arguments are the same. The above definition is clearly Hermitian symmetric and conjugate bilinear, and we know that non-negativity is met because the kernel is positive definite.
The opposite direction also holds. Suppose \(\Phi: X \rightarrow \mathbb{R}\) is given. If we define a kernel as \(k(x_i, x_j) = \langle \Phi(x_i), \Phi(x_j) \rangle\), then the kernel is positive definite:
\[\sum_i \sum_j c_i c_j k(x_i, x_j) = \langle \sum_i c_i \Phi(x_i), \sum_j c_j \Phi(x_j) \rangle = || \sum_i c_i \Phi(x_i)||^2 \geq 0\]To summarize, $\forall x_i, x_j \in X$$:
\[k(x_i, x_j) = \langle \Phi(x_i), \Phi(x_j) \rangle = \langle k(\cdot, x_i), k(\cdot, x_j) \rangle\]