For a given probability distribution \(p(x; \theta)\), the score function is defined as the gradient of the log density:
\[s(x) := \nabla_x \log p(x; \theta)\]More probable data can be generated by starting at some arbitrary \(x(0)\) and then performing gradient ascent on the score function:
\[x(t + dt) = x(t) + \alpha s(x)\]How does one train a score-based diffusion model?
The Fisher divergence between two distributions \(p(x)\) and \(q(x)\) is defined as:
\[D_F(p, q) := \frac{1}{2} \mathbb{E}_{x \sim p} \Big[ \lvert \lvert \nabla_x \log p(x) - \nabla_x \log q(x) \lvert \lvert_2^2 \Big]\]Score matching argues for minimizing the Fisher divergence between the true score function and the score function learned by the model $$
One approach to training a score-based DM is to use denoising score-matching. The idea is to add small perturbative noise to the data and train a network to remove the noise.
TODO: Vincent 2010
TOOD: Song et al. 2019
TODO
https://deepgenerativemodels.github.io/assets/slides/cs236_lecture12.pdf
Diffusion models can be “guided” after training by combining the diffusion model \(p_{\theta}(x)\) with a predictive model \(p_{\theta}(y|x)\) to generate samples from \(p_{\theta}(x|y)\). TODO: cite (Sohl-Dickstein et al., 2015; Dhariwal & Nichol, 2021
TOOD: Ho & Salimans, 2022
Du et al. (ICML 2023) studied how to reuse and compose diffusion models with one another, correcting previously not fully correct methods and suggesting a new energy-based parameterization for diffusion models.
Blattmann, Rombach et al. (NeurIPS 2022) proposed Retrieval-Augmented Diffusion Models. The idea is to combine relatively small diffusion or autoregressive models with a separate image database and a retrieval strategy. During training, nearby neighbors are retrieved from the database, then the model is conditioned on these informative examples. The database can then be swapped out at test time in a manner that transfers well to new tasks e.g., class-conditional synthethesis, zero-shot stylization or text-to-image synthesis. The authors use CLIP to embed images and text for retrieval. To feed the retrieved CLIP-embedded data into the diffusion or autoregressive network, the authors use a cross-attention mechanism.