Rylan Schaeffer

Logo
Resume
Research
Learning
Blog
Teaching
Jokes
Kernel Papers


Diffusion Models

Score-Based Diffusion Models

For a given probability distribution \(p(x; \theta)\), the score function is defined as the gradient of the log density:

\[s(x) := \nabla_x \log p(x; \theta)\]

More probable data can be generated by starting at some arbitrary \(x(0)\) and then performing gradient ascent on the score function:

\[x(t + dt) = x(t) + \alpha s(x)\]

How does one train a score-based diffusion model?

Score Matching

The Fisher divergence between two distributions \(p(x)\) and \(q(x)\) is defined as:

\[D_F(p, q) := \frac{1}{2} \mathbb{E}_{x \sim p} \Big[ \lvert \lvert \nabla_x \log p(x) - \nabla_x \log q(x) \lvert \lvert_2^2 \Big]\]

Score matching argues for minimizing the Fisher divergence between the true score function and the score function learned by the model $$

One approach to training a score-based DM is to use denoising score-matching. The idea is to add small perturbative noise to the data and train a network to remove the noise.

Denoising Score Matching

TODO: Vincent 2010

Sliced Score Matching

TOOD: Song et al. 2019

Noise Conditional Score Networks

Noise Contrastive Estimation

TODO

https://deepgenerativemodels.github.io/assets/slides/cs236_lecture12.pdf

Guidance a.k.a. Controllable Generation

Classifier Guidance

Diffusion models can be “guided” after training by combining the diffusion model \(p_{\theta}(x)\) with a predictive model \(p_{\theta}(y|x)\) to generate samples from \(p_{\theta}(x|y)\). TODO: cite (Sohl-Dickstein et al., 2015; Dhariwal & Nichol, 2021

Classifier-Free Guidance

TOOD: Ho & Salimans, 2022

Composition of Diffusion Models

Du et al. (ICML 2023) studied how to reuse and compose diffusion models with one another, correcting previously not fully correct methods and suggesting a new energy-based parameterization for diffusion models.

Retrieval-Augmented Diffusion Models

Blattmann, Rombach et al. (NeurIPS 2022) proposed Retrieval-Augmented Diffusion Models. The idea is to combine relatively small diffusion or autoregressive models with a separate image database and a retrieval strategy. During training, nearby neighbors are retrieved from the database, then the model is conditioned on these informative examples. The database can then be swapped out at test time in a manner that transfers well to new tasks e.g., class-conditional synthethesis, zero-shot stylization or text-to-image synthesis. The authors use CLIP to embed images and text for retrieval. To feed the retrieved CLIP-embedded data into the diffusion or autoregressive network, the authors use a cross-attention mechanism.

Connections to Other Topics