K Means++ is an algorithm to solve the K Means problem identical to Lloyd’s Algorithm but that uses a different initialization of cluster centroids. By intelligently choosing the centroids, K Means++ gains performance guarantees and typically converges significantly faster.
The K Means++ algorithm initializes centroids in the following manner:
Then, K Means++ switches to Lloyd’s algorithm starting with the sampled centroids.
Overview: The approach will be to show (1) that if we could initialize one centroid in each cluster in the optimal clustering, the error is bounded, and (2) we can come close to initializing one centroid in each optimal cluster using the above sampling procedure.
Lemma 1: Let \(X\) be a set of points, and define \(\mu = \frac{1}{\lvert X \lvert} \sum_{x \in X} x\) as the mean. Let \(z\) be an arbitrary point. Then
\[\sum_x \lvert \lvert x - z \lvert \lvert^2 - \sum_x \lvert \lvert x - z \lvert \lvert^2 = \lvert X \lvert \, \lvert \lvert \mu - z \lvert \lvert^2\]Proof: Starting with the LHS and letting \(\lvert X \lvert\) denote the cardinality of \(X\).
\[\begin{align*} &\sum_x \lvert \lvert x - z \lvert \lvert^2 - \sum_x \lvert \lvert x - z \lvert \lvert^2\\ &= \sum_x x^T x - 2x^T z + z^T z - (\sum_x x^T x - 2 x^T \mu + \mu^T \mu)\\ &= \lvert X \lvert z^T z - \lvert X \lvert \mu^T \mu + \sum_x -2 x^T z + 2 x^T \mu\\ &= \lvert X \lvert z^T z - \lvert X \lvert \mu^T \mu + \frac{\lvert X \lvert}{\lvert X \lvert}\sum_x -2 x^T z + 2 x^T \mu\\ &= \lvert X \lvert (z^T z + \mu^T \mu - 2 \mu^T z)\\ &= \lvert X \lvert \lvert \lvert \mu - z \lvert \lvert^2 \end{align*}\]Lemma 2: Let \(A\) be an arbitrary cluster in \(C_{opt}\) and let \(C\) be the clustering with a single centroid, chosen uniformly at random from \(A\). Then
\[\mathbb{E}[\phi(A)] = 2 \phi_{opt}(A)\]Proof: Define the optimal centroid \(\mu_A := \frac{1}{\lvert A \lvert} \sum_{a \in A} a\).
\[\begin{align*} &\mathbb{E}[\phi(A)]\\ &= \sum_{a_0 \in A} p(a_0 \text{ initial centroid}) \, \phi(A)\\ &= \sum_{a_0 \in A} \frac{1}{\lvert A \lvert} \sum_{a \in A} \lvert \lvert a - a_0 \lvert \lvert^2\\ &= \frac{1}{\lvert A \lvert} \sum_{a_0 \in A} \Big(\lvert A \lvert \, \lvert \lvert a_0 - \mu_A \lvert \lvert^2 + \sum_{a \in A} \lvert \lvert a - \mu_A \lvert \lvert^2 \Big)\\ &= \sum_{a_0 \in A} \lvert \lvert a_0 - \mu_A \lvert \lvert^2 + \frac{1}{\lvert A \lvert} \sum_{a_0 \in A} \sum_{a \in A} \lvert \lvert a - \mu_A \lvert \lvert^2\\ &= 2 \sum_{a \in A} \lvert \lvert a - \mu_A \lvert \lvert^2\\ &= 2 \phi_{opt}(A) \end{align*}\]Intuitively, this says that if you use a random point in a set as the centroid in lieu of the optimal centroid, then in expectation, the error is only twice the optimal error. This can be seen qualitatively because while some points will be closer to an arbitrary point than to the ideal centroid, most of the other points will be further away.