Resume
Research
Learning
Blog
Teaching
Jokes
Kernel Papers

# Metric Embeddings

Oftentimes, we have some set with some notion of distances between elements i.e. a metric space, and we want to embed that metric space inside another, either for easier analyses or visualization. For example, we may want to convert a graph metric space into an $$\ell_p$$ metric space.

## Isometric Embeddings

Given 2 metric spaces $$(X, d_X), (Y, d_Y)$$, a function $$f: X \rightarrow Y$$ is an isometric embedding iff $$\forall x, x' \in X$$:

$d_Y(f(x), f(x')) = d_X(x, x')$

Any finite metric space can always bee isometrically embedded into $$\ell_{\infty}$$. The embedding is as follows: Given $$(X, d)$$, with $$|X| = n$$, define $$f: X \rightarrow \mathbb{R}^n$$ by

$f(x_i) := [d(x_1, x_i), d(x_2, x_i), ..., d(x_n, x_i)$

This is called a Frechet embedding, and has the property that $$d(x_, x_j) = d_{\infty}(f(x_i), f(x_j))$$. However, an isometric embedding to a target metric space doesn’t always exist. For instance, a square graph (4 vertices) with edge weights 1 cannot be embedded into $$\ell_2$$. This raises the prospect of ““low distortion” embeddings.

## Low Distortion Embeddings

Given 2 metric spaces $$(X, d_X), (Y, d_Y)$$, a function $$f: X \rightarrow Y$$ is an embedding of $$X into$$Y $$with distortion$$\alpha$$and scaling$$\beta$$if$$\forall x, x’ \in X:

$\beta d_X(x, x') \leq d_Y(f(x), f(x')) \leq \alpha \beta d_X(x, x')$

## Bourgain’s Embedding

Given any finite metric space $$(X, d)$$ with $$|X| = n$$, there exists an embedding of $$(X, d)$$ into $$(\mathbb{R}^k, d_1)$$ where $$k = O(\log^2 n)$$ (also works for $$\ell_p$$) with distortion $$O(\log n)$$.

### Algorithm for Bourgain’s Embedding

1. For $$i = 1, ..., \log n$$:
1. For $$j = 1, ..., c \log n$$:
1. Choose $$S_{i, j} \subseteq X$$ at random such that every $$x \in XX$$ is contained in $$S_{i,j}$$ with probability $$2^{i}$$

Define $$f(x) = [d(x, S_{11}), d(x, S_{12}),..., d(x, S_{\log n, c\log n}),]$$ where $$d(x, S) = \min_{s \in S} d(x, s)$$

### Intuition for Bourgain’s Embedding

In order to show that an embedding is low-distortion, we need to show both that distances don’t expand or contract by too much. Let’s first consider expansion. Consider $$f(x) := d_S(x, S)$$ i.e. the distance from $$x$$ to the point in set $$S$$ nearest $$x$$. This distance has the property of being non-expanding:

$\forall x, y \in X, ||f_S(x) - f_S(y)|| \leq d(x, y)$

Thus, Bourgain’s Embedding is non-expanding. Next, we argue that Bourgain’s Embedding isn’t too “shrinking” i.e. that $$|f_S(x) - f_S(y)| \geq \Delta$$. The argument is going to consist of two parts: First, we’re going to assume a particular picture, and argue that under this particular picture, the embedding isn’t too shrinking. Second, we’re going to argue that this particular picture is highly likely.

First, picture a ball around $$x$$ and another ball around $$y$$ with slightly different radii. Call the balls $$B_x := \{z \in X: d(z, x) \leq \delta \}$$ and $$B_y := \{z \in X: d(z, x) \leq \delta + \Delta\}$$. Imagine that the set $$S$$ intersects with $$B_x$$ but not with $$B_y$$. In this picture, we know that $$d(x, S) \leq \delta$$, since $$S$$ intersects $$B_x$$, and we know that $$d(y, S) > \delta + \Delta$$, since $$B_y$$ does not intersect $$S$$. Consequently:

$d(y, S)- d(x, S)\geq \delta + \Delta - \delta = \Delta$

This means that the original distances don’t shrink by more than $$\Delta$$. By constructing Bourgain’s Embedding in the above manner, this picture will turn out to be likely with high probability. The inner loop over $$j$$ tries to find this good picture, while the outer loop tries to find the right density for $$S$$, which depends on $$d(x,y)$$ in the data.

### Formal Proof for Bourgain’s Embedding

Bourgain’s Theorem: Let $$k = c \log^2 n$$ be the number of sets $$S_{ij}$$. Let $$f$$ be Bourgain’s embedding. There exists a constant $$b$$ such that $$\forall x, y \in X$$:

$\frac{k}{b \log n} d(x, y) \leq |f(x), f(y)|_1 \leq k d(x, y)$

Proof: Starting with the upper bound, it’s enough to show that for every $$S \subset X$$:

$|d(x, S) - d(y, S)| \leq d(x, y)$

If that were true, then:

$||f(x) - f(y) ||_1 = \sum_{ij} |d(x, S_{ij}) - d(y, S_{ij}) | \leq k d(x, y)$

The proof is via the triangle inequality. Because the metric $$d$$ must obey the triangle inequality, the distance from $$x$$ to the nearest element of $$S$$ must be less than the distance from $$x$$ to $$y$$, then from $$y$$ to the point in $$S$$ nearest to $$x$$. Turning next to the lower bound, we want to show:

$\frac{k}{b \log n} d(x, y) \leq ||f(x) - f(y)||_1$

Recall, in our good situation, $$|d(x, S) - d(y, S)| \geq \Delta$$. We now choose a bunch of different $$\delta$$ and show that the good situation occurs with decent probability. Fix $$x, y \in X$$. Choose $$0 < \delta_1 < \delta_2 < ...$$ such that $$\delta_i$$ is the smallest value such that $$B(x; \delta_i) := \{ z \in X : d(x, z) \leq \delta \}$$ and $$B(y; \delta_i)$$ both have $$2^i$$ points in them. We keep choosing $$\delta_i$$ until $$\delta_i < d(x, y) / 3$$. Let $$\delta_{i+1} = d(x, y) / 3$$. There are two subclaims: $$|d(x, S) - d(y, S)| \geq \delta_{i+1} - \delta_i$$, and that this is decently likely to produce the nice picture above.

Why is the nice situation decently likely? By definition of $$\delta_{i+1}$$, either $$<2^{i+1}$$ points are in $$B(x, \delta_{i+1})$$ or the same for $$B(y, \delta_{i+1})$$. WLOG, assume this occurs for $$y$$. By definition of $$\delta_i$$, $$\geq 2^{i}$$ points live in $$B(x, \delta_i)$$. Since $$\mathbb{P}[x \in S_{ij}] = 2^{-i} \forall x \in X$$ the good situation can be decomposed into the product of the probability that $$B(x, \delta_i)$$ intersects with $$S$$ and the probability that $$B(y, \delta_i)$$ does not intersect with $$S$$:

$\mathbb{P}[\text{good picture}] = \mathbb{P}[B(x) \cap S] \mathbb{P}[B(y) \not \cap S] \geq (1 - (1 - 2^{-i}))^{2_i} (1 - 2^{-i})^{2i+1} \geq 2^{-5}$

Fix $$i$$. The probability that at least $$2^{-6}$$ of the $$S_{ij}$$ have the “nice” situation is

$1 - \mathbb{P}[\sum_j^{c \log n} \mathbb{I}[nice] \leq \frac{1}{2} \frac{c \log n}{2^5}] \geq 1 - \exp(-\frac{c log n}{8 2^5}) \geq 1 - n^{-3}$

if we choose $$c \geq 3 * 2^8$$. Putting it together, we showed that when our nice situation occurs, $$|d(x, S) - d(y, S)| \geq \delta_{i+1} - \delta_i$$, and that for each $$i$$, the probability that at least $$d \log n / 2^6$$ of the $$S_{ij}$$ have this nice situation is at least $$1 - n^{-3}$$. Then, by a union bound over all $$n^2$$ pairs $$(x, y)$$ and $$\leq \log n$$ choices of $$i$$, with high probability, $$\forall i, x, y$$, $$\geq c \log n / 2^6$$ of the $$S_{ij}$$ have $$|d(x, S) - d(y, S)| \geq \delta_{i+1} - \delta_i$$, meaning:

\begin{align*} ||f(x) - f(y) ||_1 \geq \sum_{ij} |d(x, S_{ij}) - d(y, S_{ij})| \geq \sum_{ij} (\frac{c \log n}{2^6}) (\delta_{i+1} - \delta_i)\\ &= \frac{c \log n}{2^6} \delta_{t+1}\\ &= \frac{c \log n}{3 2^6} d(x, y)\\ \end{align*}

Recalling that $$k = c \log^2 n$$, we have

$||f(x) - f(y) ||_1 \geq \frac{k}{3 2^t \log n} d(x, y)$

## Johnson-Lindenstrauss Lemma

Bourgain’s Embedding says we can embed any metric space into $$\ell_1$$ with distortion $$O(\log n)$$. One naturally asks next whether a better embedding is possible. In general, the answer is no, but for more structured metric spaces, the answer is yes. The Johnson-Lindenstrauss (JL) Lemma tells us that if we have $$n$$ points that live in an $$\ell_2$$ space of dimension $$d$$, then we can embed those points into dimension $$O(\frac{\log n}{\epsilon^2})$$ with distortion $$1 + \epsilon$$. Formally:

Theorem: Given $$\epsilon \in (0, 1)$$ and a set $$X \subset \mathbb{R}^d$$ with $$|X| = n$$, there exist a linear map $$f: \mathbb{R}^d \rightarrow \mathbb{R}^m$$ where $$m = O(\frac{\log n}{\epsilon^2})$$ that embeds $$(X, d_2)$$ into $$(\mathbb{R}^m, d_2)$$ with distortion $$\leq 1 + \epsilon$$.

Recall, distortion means that if the distance between any two points in the original space is $$\delta$$, then in the new space, the distance is $$(1 \pm \epsilon)\delta$$.

Proof: Let $$A \in \mathbb{R}^{m \times d}$$ be a matrix with entries chosen i.i.i. $$\mathcal{N}(0, 1/m)$$. Define $$f(x)= Ax$$. WLOG, because an isotropic Gaussian is rotationally symmetric, for any two $$x, y$$, we can think of $$x - y = ||x - y||_2 e_1$$, where $$e_1$$ is the first canonical basis vector. Our goal is to show $$||A (x - y)||_2 = (1 \pm \epsilon) ||x - y|| \Leftrightarrow ||A e_1|| = (1 \pm \epsilon)$$.

Now, we just need to consider the first column of $$A$$ in our rotated system. We can use a Chernoff-style claim: If $$Z_1, ..., Z_m \sim N(0, 1)$$, then $$\mathbb{P}[\sum Z_i^2 > (1+\epislon)m] \leq \exp(- m \epsilon^2 / 8)$$. Thus, for any pair $$x, y$$, the probability $$A(x -y)$$ is distorted by $$\epsilon$$ decays exponentially in $$\epsilon$$. By a union bound, we have:

$\mathbb{P}[\exists x, y \in X s.t. ||A (x-y)||_2 \geq (1+\epsilon) ||x - y||_2] \leq n^2 \exp(-m\epsilon^2/8)$

Choose $$m = O(\log(n)/\epsilon^2)$$ and we’re good. A similar argument holds for the lower bound $$1-\epsilon$$.