“So we seem to find, in our selves, unity over time, unity of function, unity in space. Yet each of these unities is contingent. In pathology or in thought experiment, the integrity ofour memories can be compromised, the connectivity of our brains can be split or fragmented, and our bodies can be broken apart and rebuilt. What then am I?“
by Kulis and Jordan (ICML 2012)
One limitation of k-means clustering is that the number of clusters must be specified in advance. Here, Kulis and Jordan show that a low-variance asymptotic analysis of a Dirichlet process Gaussian Mixture Model (DP-GMM) yields a k-means-like clustering algorithm that adds new clusters when justified by the data.
We quickly review k-means clustering and how it can be derived from a low variance asymptotic approximation of expectation maximization applied to a Gaussian mixture model. For a finite GMM with N observations {xn}, we want to infer cluster assignments {zn} and each cluster’s Gaussian parameters Gaussian parameters {μc,Σc}. Let πc denote the mixing proportion of the cth class. The marginal distribution of an observation is:
p(xn)=C∑c=1p(xn|zn=c)p(zn=c)=C∑c=1N(xn;μc,Σc)πcThe cluster assignments and cluster parameters can be learnt using Expectation Maximization. Assuming the clusters each have equal covariance σI, the E step takes the following form:
p(zn=c|xn)=p(xn|zn=c)p(zn=c)p(xn)=exp(−12σ||xn−μc||22)πc∑cexp(−12σ||xn−μc||22)πcAs σ→0, all mass is placed on the nearest centroid, just as we do in k-means clustering. The M step then updates the centroids (cluster means) to the average of the points assigned to that particular cluster:
μc=1|Xc|∑xn∈Xcxnwhere Xc={xn:p(zn=c|xn)=1} is the set of observations assigned to the cth cluster.
We can phrase this as an optimization problem with the objective function:
min{μc},{zn}N∑n=1I(zn=c)||xn−μc||22where the cluster centroids are defined above.
In a DP-GMM, we now have an infinite number of clusters, enabled through the Dirichlet Process. Specifically, we draw an infinite number of cluster centroids {μc}∞c=1 from a base distribution G0 and draw mixing proportions {πc}∞c=1 from DP(α,G0). Then, for each observation, we posit that we draw
μ1,μ2,...∼G0 tags: dirichlet-process - k-means - clustering - bayesian-nonparametrics - mixture-models