At a high level, kernel methods are a machine learning approach that works by first transforming data (images, songs) into (possibly infinite dimensional) feature spaces and then comparing the similarity of features through an inner product. Consequently, kernel methods are typically applicable when data is rotationally invariant.
Two important prerequisites to kernels are inner product spaces and Hilbert spaces.
Simply, a kernel is a dot product between features of objects. Formally, let \(X\) be a non-empty set. A function \(k: X \times X \rightarrow \mathbb{R}\) is a kernel if there exists an \(\mathbb{R}\)-Hibert space \(H\) and a so-called feature map \(\Phi: X \rightarrow H\) such that \(\forall x, x' \in X\),
\[k(x, x') = \langle \Phi(x), \Phi(x') \rangle\]Addition: The sum of two kernels is a kernel. The difference of two kernels is not necessarily a kernel (as the difference may violate that inner products remain non-negative).
Positive Scalar Multiplication: A kernel multiplied by a positive scalar is a kernel.
Mapping Between Spaces: If \(X, \tilde{X}\) are sets, \(k\) is a kernel on \(\tilde{X}\) and \(A: X \rightarrow \tilde{X}\) is a map, then \(k(A(x), A(x'))\) is a kernel on \(X\).
Multiplication: If \(k_1\) on \(X_1\) and \(k_2\) on \(X_2\) are kernels, then \(k_1 \times k_2\) is a kernel on \(X_1 \times X_2\). If \(X_1=X_2=X\), then \(k = k_1 \times k_2\) is a kernel on \(X\).
Polynomials: Let \(\Phi(x), \Phi(x') \in \mathbb{R}^d\) for \(d \geq 1\) and let \(m \geq 1\) be an integer and \(c \geq 0\). Then \(k(\Phi(x), \Phi(x')) = (\langle \Phi(x), \Phi(x') \rangle + c)^m\) is a valid kernel.
Taylor Series: Combining polynomials with positive scalar multiplication shows that Taylor series expansions with positive coefficients are valid kernels.
Positive Definiteness: see below
Exponential: \(k(x, x') = \exp(\langle x, x' \rangle)\)
Gaussian: \(k(x,x') = \exp(-\gamma^{-2} \langle x - x', x - x' \rangle)\)
Suppose we’re given a function with two arguments \(k(x, x')\). How can we determine whether the function is a valid kernel? We might consider two options.
Check the positive definiteness of the kernel.
Find a feature map \(\Phi\) satisfying the requirement that \(k(x,x') = \langle \Phi(x),\Phi(x') \rangle\).
We’ll consider each in turn.
One option to check whether a symmetric function \(k(x,x')\) is a kernel is to check whether the function is positive definite. A symmetric function \(k: X \times X \rightarrow \mathbb{R}\) is positive definite if \(\forall n \geq 1, \, \forall (a_1, ..., a_n) \in \mathbb{R}^N,\, \forall (x_1,...,x_N) \in X^n\)
\[\sum_{i=1}^N \sum_{j=1}^N a_i a_j k(x_i, x_j) \geq 0\]Note: The terminology here can vary. Some say that the function is strictly positive definite if equality holds only when all the \(a_i\) are zero for mutually distinct \(x_i\), whereas others call the above definition positive semi-definite and then label strictly positive definite as positive definite. I use the first terminology because it agrees with Arthur Gretton’s usage, whose videos I watched to learn this subject. The relevance of positive definiteness is that if a function is a kernel, then it must be positive definite:
Theorem: Let \(X\) be a non-empty set, \(H\) be a Hilbert space and \(\Phi: X \rightarrow H\). Then \(k(x,x') = \langle \Phi(x),\Phi(x')\rangle_H\) is positive definite.
TODO: Show proof that converse is true i.e. that in positive definite \(k(x,x')\) is inner product in a unique \(\mathcal{H}\).
Positive definite (PD) kernels are also called reproducing kernels and we’re about to understand why. The key idea is that any PD kernel can be viewed as a dot product in some space.
Suppose we’re given a kernel \(k(\cdot, cdot)\). How can we see that this kernel defines a dot product in some space? We need three ingredients: a vector/function space, a function \(\langle \cdot, \cdot \rangle : X \times X \rightarrow \mathbb{R}\) and a proof that this function is indeed an inner product.
First, we’ll define the vector/function space, which is also commonly called the feature space. We define a function \(\Phi: X \rightarrow \mathbb{R}^{X}\), where \(\mathbb{R}^{X}\) denotes a function mapping \(X \rightarrow R\). This means that for any \(x \in X\), we can define a vector/function \(\Phi(x) = k(\cdot, x)\). I despise this notation, called the canonical feature map, because the notation obscures that \(\Phi(x)\) is a function with an unfilled argument. If we’re given a set of \(\{x_i\}_{i}\), we can then define a function space as the span of the corresponding feature maps:
\[\text{span}\{ \Phi(x_i) \} = \text{span} \{ k(\cdot, x_i) \}\]Second, we need to define an inner product. Let \(f(\cdot) = \sum_i \alpha_i \Phi(x_i)\) and let \(g(\cdot) = \sum_i \beta_i \Phi(x_i)\) be two functions in our function space. We define the inner product as:
\[\langle f, g \rangle = \sum_i \sum_j \alpha_i \overbar{\beta_j} k(x_i, x_j)\]Third, we show that this inner product is indeed an inner product. This requires meeting three properties: Hermitian symmetry, conjugate bilinearity and non-negativity, with equality to zero implying the arguments are the same. The above definition is clearly Hermitian symmetric and conjugate bilinear, and we know that non-negativity is met because the kernel is positive definite.
The opposite direction also holds. Suppose \(\Phi: X \rightarrow \mathbb{R}\) is given. If we define a kernel as \(k(x_i, x_j) = \langle \Phi(x_i), \Phi(x_j) \rangle\), then the kernel is positive definite:
\[\sum_i \sum_j c_i c_j k(x_i, x_j) = \langle \sum_i c_i \Phi(x_i), \sum_j c_j \Phi(x_j) \rangle = || \sum_i c_i \Phi(x_i)||^2 \geq 0\]To summarize, $\forall x_i, x_j \in X$$:
\[k(x_i, x_j) = \langle \Phi(x_i), \Phi(x_j) \rangle = \langle k(\cdot, x_i), k(\cdot, x_j) \rangle\]We finally reach the definition of a Reproducing Kernel Hilbert Space (RKHS).
Formally, let \(X\) be a non-empty set and \(H\) a Hilbert space of functions \(f: X \rightarrow \mathbb{R}\). H is called a RKHS if there exists a function \(k: X \times X \rightarrow \mathbb{R}\) (i.e. a kernel) with the following properties:
\(k\) is reproducing i.e. \(\forall f(\cdot) \in H, \langle f(\cdot), k(\cdot, x) \rangle = f(x)\). Importantly, \(\langle k(x_i, \cdot), k(x_j, \cdot) \rangle = k(x_i, x_j)\).
\(k\) spans \(H\) i.e. \(H = \overbar{\text{span} \{k(x, \cdot) | x \in X \}}\), where the overline denotes completion of the space i.e. adding all limits of all Cauchy sequences.