Recently, diffusion models have demonstrated remarkable performance on several generative tasks, ranging from unconditional or text-guided image generation, audio synthesis, natural language generation, human motion synthesis to 3D pointcloud generation, most of which were previously dominated by GAN-based models. Research also has shown that diffusion models are considerably better than GANs in terms of generation quality while also being able to control the results [1]. In this blog, we will go through the mathematics and intuition behind this compelling generative model.
The underlying principle of diffusion model is based on a popular idea of non-equilibrium statistical thermodynamics: we use a Markov chain to gradually convert one distribution into another [2]. Intuitively, it is called "diffusion" because in this model, data points in our original data space are gradually diffused from one position into another until reaching a state that is just a set of noise. Unlike the common VAE or GAN models where the data is typically compressed into a low dimensional latent space, diffusion models are learned with a fixed procedure and with high dimensional latent variable (same as the original data, see Figure 1). Now let's define the forward diffusion process.
Given a sample from our real data distribution x0∼q(x0), we define the forward diffusion process as a Markov chain, which gradually adds random Gaussian noise (with diagonal covariance) to the data under a pre-defined schedule up to t=1,2,...,T steps. Specifically, the distribution of the noised sample at step t only depends on the previous step t−1:
q(xt∣xt−1)=N(xt;1−βtxt,βtI)(1)
where βt is the noise schedule controlling the step size or how slow the noises are added in the process. Typically, they are between (0,1) and should hold the followings:
0<β1<β2<...<βT<1
An interesting property of the forward process, which makes it a very powerful generative model, is that when the noise step size is small enough and T→∞, the distribution at the final step xT becomes a standard Gaussian distribution N(0,I). In other words, we can easily sample from this (pure) noise distribution and then follow a reverse process to reconstruct the real data sample, which we will discuss later.
Another notable property of this process is that we can directly sample xt at any time step t without going through the whole chain. Specifically, let αt=1−βt, from Eq. 1, we can derive the sample using reparameterization trick (x∼N(μ,σ2)⇔x=μ+σϵ) as follow:
where ϵ∼N(0,I). Note that when we add two Gaussian with variances σ12 and σ22 , the merged distribution is also a Gaussian and should have variance σ12+σ22. Therefore, we obtain the merged variance (the right sum [...] of the above equation) as:
where αˉt=∏i=1tαi. We can also re-write the forward diffusion as a distribution depended on x0 as:
q(xt∣x0)=N(xt;αˉtx0,(1−αˉt)I)(3)
This nice property can make the training much more efficient since we only need to compute the loss at some random timesteps t instead of a whole process, which we will see later.
We have seen that the forward diffusion essentially pushes a sample off the data manifold and slowly transforms it into noise. The main goal of the reverse process is to learn a trajectory back to the manifold and thus reconstruct the sample in the original data distribution from the noise.
Feller et. al [3] demonstrated that for Gaussian diffusion process with small step size β, the reverse of the diffusion process has the identical functional form as the forward process. This means that q(xt∣xt−1) is a Gaussian distribution, and if βt is small, then q(xt−1∣xt) will also be a Gaussian distribution.
Unfortunately, estimating reverse distribution q(xt−1∣xt) is very difficult because in early steps of the reverse process (t is near T), our sample is just a random noise with high variance. In other words, there exist too many trajectories for us to arrive at our original data x0 from that noise, which we may need to use the entire dataset to integrate. And ultimately, our final goal is to learn a model to approximate these reverse distributions in order to build the generative reverse denoising process, specifically:
where μθ(xt,t) is the mean predicted by the model (a neural network with parameters θ), and Σθ(xt,t) is the variance that is typically kept fixed for efficient learning.
Fortunately, by conditioning on x0, the posterior of the reverse process is tractable and becomes a Gaussian distribution [4]. Our target now is to find the posterior mean μ~t(xt,x0) and the variance β~t. Following Bayes rule, we can write:
q(xt∣xt−1,x0)=q(xt∣xt−1) because the process is a Markov chain, where the future state (xt) is conditionally independent with all previous states (x0,…,xt−2) given the present state (xt−1). We also leverage the fact that both of the distributions on the right hand side are Gaussian, which means we can apply the density function (N(x;μ,σ)∝exp(−2σ2(x−μ)2)) as follow:
Comparing with the Gaussian density function exp(−2σ2(x−μ)2)=exp(−21(σ2x2−σ22μx+σ2μ2)), we can see some similaries. Recall that αˉt=∏i=1tαi=αtαˉt−1 and βt+αt=1. Matching the variance term, we have the posterior variance:
So far we have provided the core concepts behind the two essential constituents of diffusion model: forward process and reverse process. In the next post, we will go into detail about the training loss as well as the sampling algorithm of diffusion model.
[1] Dhariwal P, Nichol A. Diffusion models beat gans on image synthesis. In NIPS 2021
[2] Sohl-Dickstein J, Weiss E, Maheswaranathan N, Ganguli S. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML 2015.
[3] Feller W. On the theory of stochastic processes, with particular reference to applications. In Selected Papers I 2015 (pp. 769-798). Springer, Cham.
[4] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. In NIPS 2020