Metric Embeddings

Oftentimes, we have some set with some notion of distances between elements i.e. a metric space, and we want to embed that metric space inside another, either for easier analyses or visualization. For example, we may want to convert a graph metric space into an $\ell_p$ metric space.

Isometric Embeddings

Given 2 metric spaces $(X, d_X), (Y, d_Y)$, a function $f: X \rightarrow Y$ is an isometric embedding iff $\forall x, x' \in X$:

\[d_Y(f(x), f(x')) = d_X(x, x')\]

Any finite metric space can always bee isometrically embedded into $\ell_{\infty}$. The embedding is as follows: Given $(X, d)$, with $|X| = n$, define $f: X \rightarrow \mathbb{R}^n$ by

\[f(x_i) := [d(x_1, x_i), d(x_2, x_i), ..., d(x_n, x_i)\]

This is called a Frechet embedding, and has the property that $d(x_, x_j) = d_{\infty}(f(x_i), f(x_j))$. However, an isometric embedding to a target metric space doesn’t always exist. For instance, a square graph (4 vertices) with edge weights 1 cannot be embedded into $\ell_2$. This raises the prospect of ““low distortion” embeddings.

Low Distortion Embeddings

Given 2 metric spaces $(X, d_X), (Y, d_Y)$, a function $f: X \rightarrow Y$ is an embedding of $X into$Y $with distortion$\alpha$and scaling$\beta$if$\forall x, x’ \in X$$:

\[\beta d_X(x, x') \leq d_Y(f(x), f(x')) \leq \alpha \beta d_X(x, x')\]

Bourgain’s Embedding

Given any finite metric space $(X, d)$ with $|X| = n$, there exists an embedding of $(X, d)$ into $(\mathbb{R}^k, d_1)$ where $k = O(\log^2 n)$ (also works for $\ell_p$) with distortion $O(\log n)$.

Algorithm for Bourgain’s Embedding

For $i = 1, ..., \log n$:
1. For $j = 1, ..., c \log n$:
  1. Choose $S_{i, j} \subseteq X$ at random such that every $x \in XX$ is contained in $S_{i,j}$ with probability $2^{i}$

Define $f(x) = [d(x, S_{11}), d(x, S_{12}),..., d(x, S_{\log n, c\log n}),]$ where $d(x, S) = \min_{s \in S} d(x, s)$

Intuition for Bourgain’s Embedding

In order to show that an embedding is low-distortion, we need to show both that distances don’t expand or contract by too much. Let’s first consider expansion. Consider $f(x) := d_S(x, S)$ i.e. the distance from $x$ to the point in set $S$ nearest $x$. This distance has the property of being non-expanding:

\[\forall x, y \in X, ||f_S(x) - f_S(y)|| \leq d(x, y)\]

Thus, Bourgain’s Embedding is non-expanding. Next, we argue that Bourgain’s Embedding isn’t too “shrinking” i.e. that $|f_S(x) - f_S(y)| \geq \Delta$. The argument is going to consist of two parts: First, we’re going to assume a particular picture, and argue that under this particular picture, the embedding isn’t too shrinking. Second, we’re going to argue that this particular picture is highly likely.

First, picture a ball around $x$ and another ball around $y$ with slightly different radii. Call the balls $B_x := \{z \in X: d(z, x) \leq \delta \}$ and $B_y := \{z \in X: d(z, x) \leq \delta + \Delta\}$. Imagine that the set $S$ intersects with $B_x$ but not with $B_y$. In this picture, we know that $d(x, S) \leq \delta$, since $S$ intersects $B_x$, and we know that $d(y, S) > \delta + \Delta$, since $B_y$ does not intersect $S$. Consequently:

\[d(y, S)- d(x, S)\geq \delta + \Delta - \delta = \Delta\]

This means that the original distances don’t shrink by more than $\Delta$. By constructing Bourgain’s Embedding in the above manner, this picture will turn out to be likely with high probability. The inner loop over $j$ tries to find this good picture, while the outer loop tries to find the right density for $S$, which depends on $d(x,y)$ in the data.

Formal Proof for Bourgain’s Embedding

Bourgain’s Theorem: Let $k = c \log^2 n$ be the number of sets $S_{ij}$. Let $f$ be Bourgain’s embedding. There exists a constant $b$ such that $\forall x, y \in X$:

\[\frac{k}{b \log n} d(x, y) \leq |f(x), f(y)|_1 \leq k d(x, y)\]

Proof: Starting with the upper bound, it’s enough to show that for every $S \subset X$:

\[|d(x, S) - d(y, S)| \leq d(x, y)\]

If that were true, then:

\[||f(x) - f(y) ||_1 = \sum_{ij} |d(x, S_{ij}) - d(y, S_{ij}) | \leq k d(x, y)\]

The proof is via the triangle inequality. Because the metric $d$ must obey the triangle inequality, the distance from $x$ to the nearest element of $S$ must be less than the distance from $x$ to $y$, then from $y$ to the point in $S$ nearest to $x$. Turning next to the lower bound, we want to show:

\[\frac{k}{b \log n} d(x, y) \leq ||f(x) - f(y)||_1\]

Recall, in our good situation, $|d(x, S) - d(y, S)| \geq \Delta$. We now choose a bunch of different $\delta$ and show that the good situation occurs with decent probability. Fix $x, y \in X$. Choose $0 < \delta_1 < \delta_2 < ...$ such that $\delta_i$ is the smallest value such that $B(x; \delta_i) := \{ z \in X : d(x, z) \leq \delta \}$ and $B(y; \delta_i)$ both have $2^i$ points in them. We keep choosing $\delta_i$ until $\delta_i < d(x, y) / 3$. Let $\delta_{i+1} = d(x, y) / 3$. There are two subclaims: $|d(x, S) - d(y, S)| \geq \delta_{i+1} - \delta_i$, and that this is decently likely to produce the nice picture above.

Why is the nice situation decently likely? By definition of $\delta_{i+1}$, either $<2^{i+1}$ points are in $B(x, \delta_{i+1})$ or the same for $B(y, \delta_{i+1})$. WLOG, assume this occurs for $y$. By definition of $\delta_i$, $\geq 2^{i}$ points live in $B(x, \delta_i)$. Since $\mathbb{P}[x \in S_{ij}] = 2^{-i} \forall x \in X$ the good situation can be decomposed into the product of the probability that $B(x, \delta_i)$ intersects with $S$ and the probability that $B(y, \delta_i)$ does not intersect with $S$:

\[\mathbb{P}[\text{good picture}] = \mathbb{P}[B(x) \cap S] \mathbb{P}[B(y) \not \cap S] \geq (1 - (1 - 2^{-i}))^{2_i} (1 - 2^{-i})^{2i+1} \geq 2^{-5}\]

Fix $i$. The probability that at least $2^{-6}$ of the $S_{ij}$ have the “nice” situation is

\[1 - \mathbb{P}[\sum_j^{c \log n} \mathbb{I}[nice] \leq \frac{1}{2} \frac{c \log n}{2^5}] \geq 1 - \exp(-\frac{c log n}{8 2^5}) \geq 1 - n^{-3}\]

if we choose $c \geq 3 * 2^8$. Putting it together, we showed that when our nice situation occurs, $|d(x, S) - d(y, S)| \geq \delta_{i+1} - \delta_i$, and that for each $i$, the probability that at least $d \log n / 2^6$ of the $S_{ij}$ have this nice situation is at least $1 - n^{-3}$. Then, by a union bound over all $n^2$ pairs $(x, y)$ and $\leq \log n$ choices of $i$, with high probability, $\forall i, x, y$, $\geq c \log n / 2^6$ of the $S_{ij}$ have $|d(x, S) - d(y, S)| \geq \delta_{i+1} - \delta_i$, meaning:

\[\begin{align*} ||f(x) - f(y) ||_1 \geq \sum_{ij} |d(x, S_{ij}) - d(y, S_{ij})| \geq \sum_{ij} (\frac{c \log n}{2^6}) (\delta_{i+1} - \delta_i)\\ &= \frac{c \log n}{2^6} \delta_{t+1}\\ &= \frac{c \log n}{3 2^6} d(x, y)\\ \end{align*}\]

Recalling that $k = c \log^2 n$, we have

\[||f(x) - f(y) ||_1 \geq \frac{k}{3 2^t \log n} d(x, y)\]

Johnson-Lindenstrauss Lemma

Bourgain’s Embedding says we can embed any metric space into $\ell_1$ with distortion $O(\log n)$. One naturally asks next whether a better embedding is possible. In general, the answer is no, but for more structured metric spaces, the answer is yes. The Johnson-Lindenstrauss (JL) Lemma tells us that if we have $n$ points that live in an $\ell_2$ space of dimension $d$, then we can embed those points into dimension $O(\frac{\log n}{\epsilon^2})$ with distortion $1 + \epsilon$. Formally:

Theorem: Given $\epsilon \in (0, 1)$ and a set $X \subset \mathbb{R}^d$ with $|X| = n$, there exist a linear map $f: \mathbb{R}^d \rightarrow \mathbb{R}^m$ where $m = O(\frac{\log n}{\epsilon^2})$ that embeds $(X, d_2)$ into $(\mathbb{R}^m, d_2)$ with distortion $\leq 1 + \epsilon$.

Recall, distortion means that if the distance between any two points in the original space is $\delta$, then in the new space, the distance is $(1 \pm \epsilon)\delta$.

Proof: Let $A \in \mathbb{R}^{m \times d}$ be a matrix with entries chosen i.i.i. $\mathcal{N}(0, 1/m)$. Define $f(x)= Ax$. WLOG, because an isotropic Gaussian is rotationally symmetric, for any two $x, y$, we can think of $x - y = ||x - y||_2 e_1$, where $e_1$ is the first canonical basis vector. Our goal is to show $||A (x - y)||_2 = (1 \pm \epsilon) ||x - y|| \Leftrightarrow ||A e_1|| = (1 \pm \epsilon)$.

Now, we just need to consider the first column of $A$ in our rotated system. We can use a Chernoff-style claim: If $Z_1, ..., Z_m \sim N(0, 1)$, then $\mathbb{P}[\sum Z_i^2 > (1+\epislon)m] \leq \exp(- m \epsilon^2 / 8)$. Thus, for any pair $x, y$, the probability $A(x -y)$ is distorted by $\epsilon$ decays exponentially in $\epsilon$. By a union bound, we have:

\[\mathbb{P}[\exists x, y \in X s.t. ||A (x-y)||_2 \geq (1+\epsilon) ||x - y||_2] \leq n^2 \exp(-m\epsilon^2/8)\]

Choose $m = O(\log(n)/\epsilon^2)$ and we’re good. A similar argument holds for the lower bound $1-\epsilon$.