A sequential algorithm for fast fitting of Dirichlet process mixture models
Background
Previously, Wang and Dunson (2011) proposed a fast inference algorithm for Dirichlet process mixture models that could be applied in an online setting, fancifully called Sequential Updating and Greedy Search (SUGS). The authors actually didn’t intend for their algorithm to function online, recommending that to remove the dependency on observation order, one should average over random orderings of the observations. The way SUGS made inference easy was by taking the MAP of the latent cluster’s posterior, deterministically placing the observation at a single cluster.
Overview
- Can we design a similarly fast inference algorithm for DP mixture models that doesn’t require collapsing the latent variable posterior at each time step?
- The authors propose a variational version of SUGS called VSUGS that also doesn’t function online, but that at least no longer requires the MAP approximation
Conceptual Notes
- Goal: We’d like a posterior over cluster assignments \(z_{1:t}\) and cluster parameters \(\theta_{1:t}\) given observations \(x_{1:t}\)
- Idea:
-
Suppose at time \(t\), we have an approximate posterior for the previous time step
\[p(z_{1:t-1}, \theta_{1:t-1}|x_{1:t-1}) \approx \prod_{t' < t} q_{t-1}(z_{t'}) \prod_{t'<t} q_{t-1}(\theta_{t'})\] - In English, that means we assume we have a fully factorized variational approximation of the joint posterior from the previous time step
-
We can then update the previous time step’s variational posterior by assuming the current step will have a similar form:
\[\prod_{t' \leq t} q_{t}(z_{t'}) \prod_{t' \leq t} q_{t}(\theta_{t'})\]
-
- Challenge: How will we initialize the current time step’s variational distribution’s \(\{q_t(z_t)\}\) and \(\{q_t(\theta_t)\}\)?
- Answer: Use the previous time step’s variational distributions to initialize all terms
-
For the new terms, \(q_t(z_t)\) and \(q_t(\theta_t)\), first initialize \(q_t(z_t)\) as
\[q_t(z_t) = \underbrace{q_{ij}}_{\text{Var. Latent Prior}} \int \underbrace{q_{t-1}(\theta_k)}_{\text{Var. Param Prior}} \underbrace{p(y_t|\theta_k)}_{\text{Obs. Likelihood}} d\theta_k\]
where \(q_{ij}\) is defined using the truncated CRP (with truncation level \(T\)):
$$q_{ij} = \frac{1}{\alpha + t - 1} \begin{cases} \sum_{t'< t} q_{t-1}(z_{t'} = j) + \alpha T\\ \alpha (1 - ((i-1) \wedge T)/T) \end{cases}$$- then update parameters using Equation 11
- Challenge: How will we fit the \(q_t(z_t)\)?
- Answer: Unanswered. Minimize the ELBO. This requires access to the entire sequence.
- Challenge: How will we store a full distribution \(q(z_t)\) for all \(t\)?
- Answer: Unanswered. I think the space complexity grows with the square of number of samples \(t\) i.e. \(O(t^2)\)
