Generative Modeling by Estimating Gradients of the Data Distribution

This blog post focuses on a promising new direction for generative modeling. We can learn score functions (gradients of log probability density functions) on a large number of noise-perturbed data distributions, and then generate samples by Langevin-type sampling. The resulting generative models, often called score-based generative models (or diffusion probabilistic models), has several important advantages over existing model families: GAN-level sample quality without adversarial training, flexible model architectures, exact log-likelihood computation, uniquely identifiable representation learning, and inverse problem solving without re-training models. In this blog post, we will show you in more detail the intuition, basic concepts, and potential applications of score-based generative models.

Existing generative modeling techniques can largely be grouped into two categories based on how they represent probability distributions. (1) The first is likelihood-based models, which directly learn the distribution’s probability density (or mass) function via (approximate) maximum likelihood. Typical likelihood-based models include autoregressive models , normalizing flow models , energy-based models (EBMs), and variational auto-encoders (VAEs) . (2) The second is implicit generative models , where the probability distribution is implicitly represented by a model of its sampling process. The most prominent example is generative adversarial networks (GANs) , where new samples from the data distribution are synthesized by transforming a random Gaussian vector with a neural network.

Bayesian networks, Markov random fields (MRF), autoregressive models, and normalizing flow models are all examples of likelihood-based models. All these models represent the probability density or mass function of a distribution.

GAN is an example of implicit models. It implicitly represents a distribution over all objects that can be produced by the generator network.

Likelihood-based models and implicit generative models, however, both have significant limitations. Likelihood-based models either require strong restrictions on the model architecture to ensure a tractable normalizing constant for likelihood computation, or must rely on surrogate objectives to approximate maximum likelihood training. Implicit generative models, on the other hand, often require adversarial training, which is notoriously unstable and can lead to mode collapse .

In this blog post, I will introduce another way to represent probability distributions that may circumvent several of these limitations. The key idea is to model the gradient of the log probability density function, a quantity often known as the (Stein) score function . Such score-based models are not required to have a tractable normalizing constant, and can be directly learned by score matching .

Score function (the vector field) and density function (contours) of a mixture of two Gaussians.

Score-based models work best when trained on multiple noise-perturbed data distributions. When coupled with a perturbation process that converts data to noise, score-based models can reverse this noise perturbation process for sample generation, achieving state-of-the-art sample quality on many downstream tasks and applications. These tasks include, among others, image generation , audio synthesis (Yes, sometimes better than GANs!), shape generation, and music generation. Moreover, when the noise perturbation process is given by a stochastic differential equation (SDE), it has connections to normalizing flow models, therefore allowing exact likelihood computation and representation learning. Additionally, modeling and estimating scores facilitates inverse problem solving, with applications such as image inpainting , image colorization , compressive sensing, and medical image reconstruction (e.g., CT, MRI) .

1024 x 1024 samples generated from score-based models

This post aims to show you the motivation and intuition of score-based generative modeling, as well as its basic concepts, properties and applications.

The score function, score-based models, and score matching

Suppose we are given a dataset {x1,x2,⋯,xN}, where each point is drawn independently from an underlying data distribution p(x). Given this dataset, the goal of generative modeling is to fit a model to the data distribution such that we can synthesize new data points at will by sampling from the distribution.

In order to build such a generative model, we first need a way to represent a probability distribution. One such way, as in likelihood-based models, is to directly model the probability density function (p.d.f.) or probability mass function (p.m.f.). Let fθ(x)∈R be a real-valued function parameterized by a learnable parameter θ. We can define a p.d.f. Hereafter we only consider probability density functions. Probability mass functions are similar. via (1)pθ(x)=e−fθ(x)Zθ, where Zθ>0 is a normalizing constant dependent on θ, such that ∫pθ(x)dx=1. The function fθ(x) is an unnormalized probabilistic model, or energy-based model .

We can train pθ(x) by maximizing the log-likelihood of the data (2)maxθ∑i=1Nlog⁡pθ(xi). However, equation (2) requires pθ(x) to be a normalized probability density function. This is cumbersome because in order to compute pθ(x), we must evaluate the normalizing constant Zθ—a typically intractable quantity for any general fθ(x). Thus to make maximum likelihood training feasible, likelihood-based models must either restrict their model architectures (e.g., causal convolutions in autoregressive models, invertible networks in normalizing flow models) such that Zθ=1, or approximate the normalizing constant (e.g., variational inference in VAEs, or MCMC sampling used in contrastive divergence) which may be computationally expensive.

By modeling the score function instead of the density function, we can sidestep the difficulty of intractable normalizing constants. The score function of a distribution p(x) is defined as ∇xlog⁡p(x), and a model for the score function is called a score-based model , which we denote as sθ(x). The score-based model is learned such that sθ(x)≈∇xlog⁡p(x), and can be parameterized without worrying about the normalizing constant. For example, we can easily parameterize a score-based model with the energy-based model defined in equation (1) , via

(3)sθ(x)=∇xlog⁡pθ(x)=−∇xfθ(x)+∇xlog⁡Zθ⏟=0=−∇xfθ(x).

Note that the score-based model sθ(x) is independent of the normalizing constant Zθ ! This significantly expands the model family that we can choose from, since we don’t need any special architectures to make the normalizing constant tractable.

Parameterizing probability density functions. No matter how you change the model family and parameters, it has to be normalized (area under the curve must integrate to one).

Parameterizing score functions. No need to worry about normalization.

Similar to likelihood-based models, we can train score-based models by minimizing the Fisher divergence between the model and the data distributions, defined as

(4)Ep(x)[∥∇xlog⁡p(x)−sθ(x)∥22]

Intuitively, the Fisher divergence compares the squared ℓ2 distance between the ground-truth data score and the score-based model. Directly computing this divergence, however, is infeasible because it requires access to the unknown data score ∇xlog⁡p(x). Fortunately, there exists a family of methods called score matching Commonly used score matching methods include denoising score matching and sliced score matching . Here is an introduction to score matching and sliced score matching. that minimize the Fisher divergence without knowledge of the ground-truth data score. Score matching objectives can directly be estimated on a dataset and optimized with stochastic gradient descent, analogous to the log-likelihood objective for training likelihood-based models (with known normalizing constants). We can train the score-based model by minimizing a score matching objective, without requiring adversarial optimization.

Additionally, using the score matching objective gives us a considerable amount of modeling flexibility. The Fisher divergence itself does not require sθ(x) to be an actual score function of any normalized distribution—it simply compares the ℓ2 distance between the ground-truth data score and the score-based model, with no additional assumptions on the form of sθ(x). In fact, the only requirement on the score-based model is that it should be a vector-valued function with the same input and output dimensionality, which is easy to satisfy in practice.

As a brief summary, we can represent a distribution by modeling its score function, which can be estimated by training a score-based model of free-form architectures with score matching.

Langevin dynamics

Now, once we have trained a score-based model sθ(x)≈∇xlog⁡p(x), we can use an iterative procedure called Langevin dynamics to draw samples from it.

Langevin dynamics provides an MCMC procedure to sample from a distribution p(x) using only its score function ∇xlog⁡p(x). Specifically, it initializes the chain from an arbitrary prior distribution x0∼π(x), and then iterates the following

(5)xi+1←xi+ϵ∇xlog⁡p(x)+2ϵ zi,i=0,1,⋯,K,

where zi∼N(0,I). When ϵ→0 and K→∞, xK obtained from the procedure in (5) converges to a sample from p(x) under some regularity conditions. In practice, the error is negligible when ϵ is sufficiently small and K is sufficiently large.

Using Langevin dynamics to sample from a mixture of two Gaussians.

Note that Langevin dynamics accesses p(x) only through ∇xlog⁡p(x). Since sθ(x)≈∇xlog⁡p(x), we can produce samples from our score-based model sθ(x) by plugging it into equation (5).

Naive score-based generative modeling and its pitfalls

So far, we’ve discussed how to train a score-based model with score matching, and then produce samples via Langevin dynamics. However, this naive approach has had limited success in practice—we’ll talk about some secret pitfalls of score matching that received little attention in prior works.

Score-based generative modeling with score matching + Langevin dynamics.

This naive approach, however, has limited success in practice due to some secret pitfalls of score matching that received little notice in previous literature .

The key challenge is the fact that the estimated score functions are inaccurate in low density regions, where few data points are available for computing the score matching objective. This is expected as score matching minimizes the Fisher divergence Ep(x)[∥∇xlog⁡p(x)−sθ(x)∥22]. Since the ℓ2 differences between the true data score function and score-based model are weighted by p(x), they are largely ignored in low density regions where p(x) is small. This behavior can lead to subpar results such as the figure below:

Estimated scores are only accurate in high density regions.

When sampling with Langevin dynamics, our initial sample is highly likely in low density regions when data reside in a high dimensional space. Therefore, having an inaccurate score-based model will derail Langevin dynamics from the very beginning of the procedure, preventing it from generating high quality samples that are representative of the data.

Score-based generative modeling with multiple noise perturbations

How can we bypass the difficulty of accurate score estimation in regions of low data density? Our solution is to perturb data points with noise and train score-based models on the noisy data points instead. When the noise magnitude is sufficiently large, it can populate low data density regions to improve the accuracy of estimated scores. Concretely, here is what happens when we perturb a mixture of two Gaussians with additional Gaussian noise.

Estimated scores are accurate everywhere for the noise-perturbed data distribution due to reduced low data density regions.

Yet another question remains: how do we choose an appropriate noise scale for the perturbation process? Larger noise can obviously cover more low density regions for better score estimation, but it over-corrupts the data and alters it significantly from the original distribution. Smaller noise, on the other hand, causes less corruption of the original data distribution, but does not cover the low density regions as well as we would like.

To achieve the best of both worlds, we propose to use multiple scales of noise perturbations simultaneously . Suppose we always perturb the data with isotropic Gaussian noise of mean zero, and let there be a total of L increasing standard deviations σ1<σ2<⋯<σL. We first perturb the data distribution p(x) with each of the Gaussian noise N(0,σi2I),i=1,2,⋯,L to obtain a noise-perturbed distribution

pσi(x)=∫p(y)N(x;y,σi2I)dy.

Note that we can easily draw samples from pσi(x) by sampling x∼p(x) and computing x+σiz, where z∼N(0,I).

Next, we estimate the score function of each noise-perturbed distribution, ∇xlog⁡pσi(x), by training a Noise Conditional Score-Based Model sθ(x,i) with score matching, such that sθ(x,i)≈∇xlog⁡pσi(x) for all i=1,2,⋯,L.

We apply multiple Gaussian noise to perturb the data distribution (first row), and jointly estimate the score functions for all of them (second row).

The training objective for sθ(x,i) is a weighted sum of Fisher divergences for all noise scales. In particular, we use the objective below:

(6)∑i=1Lλ(i)Epσi(x)[∥∇xlog⁡pσi(x)−sθ(x,i)∥22],

where λ(i)∈R>0 is a positive weighting function, often chosen to be λ(i)=σi2. The objective (6) can be optimized with score matching, exactly as in optimizing the naive (unconditional) score-based model sθ(x).

After training our noise-conditional score-based model sθ(x,i), we can produce samples from it by running Langevin dynamics for i=L,L−1,⋯,1 in sequence. This method is called annealed Langevin dynamics (defined by Algorithm 1 in , and improved by ), since the noise scale σi decreases (anneals) gradually over time.

Annealed Langevin dynamics combine a sequence of Langevin chains with gradually decreasing noise scales.

Annealed Langevin dynamics for the Noise Conditional Score Network (NCSN) model (from ref.) trained on CelebA (left) and CIFAR-10 (right). We can start from complete noise, modify images according to the scores, and generate nice samples. The method achieved state-of-the-art Inception score on CIFAR-10 at its time.

Here are some practical recommendations for tuning score-based generative models with multiple noise scales:

Choose σ1<σ2<⋯<σL as a geometric progression, with σ1 being sufficiently small and σL comparable to the maximum pairwise distance between all training data points . L is typically on the order of hundreds or thousands.
Parameterize the score-based model sθ(x,i) with U-Net skip connections .
Apply exponential moving average on the weights of the score-based model when used at test time .

With such best practices, we are able to generate high quality image samples with comparable quality to GANs on various datasets, such as below:

Samples from the NCSNv2 model in ref. From left to right: FFHQ 256x256, LSUN bedroom 128x128, LSUN tower 128x128, LSUN church_outdoor 96x96, and CelebA 64x64.

Score-based generative modeling with stochastic differential equations (SDEs)

From the above discussion, we know that adding multiple noise scales is critical to the success of score-based generative models. By generalizing the number of noise scales to infinity , we can build the most powerful framework to date for score-based generative modeling. This allows for not only higher quality samples, but also exact log-likelihood computation, faster sampling, learning uniquely identifiable representations, and controllable generation for inverse problem solving.

In addition to this introduction, we have tutorials written in Google Colab to provide a step-by-step guide for training a toy model on MNIST. We also have more advanced code repositories that provide full-fledged implementations for real applications.

Link	Description
	Tutorial of score-based generative modeling with SDEs in JAX + FLAX
Code in JAX	Score SDE codebase in JAX + FLAX
	Load our pretrained checkpoints and play with sampling, likelihood computation, and controllable synthesis (JAX + FLAX)
	Tutorial of score-based generative modeling with SDEs in PyTorch
	Score SDE codebase in PyTorch
	Load our pretrained checkpoints and play with sampling, likelihood computation, and controllable synthesis (PyTorch)

Perturbing data with an SDE

When the number of noise scales approaches infinity, we essentially perturb the data distribution with continuously growing levels of noise. In this case, the noise perturbation procedure is a continuous-time stochastic process, as demonstrated below

Perturbing data to noise with a continuous-time stochastic process.

How can we represent a stochastic process in a concise way? Many stochastic processes (diffusion processes in particular) are solutions of stochastic differential equations (SDEs). In general, an SDE possesses the following form:

(7)dx=f(x,t)dt+g(t)dw,

where f(⋅,t):Rd→Rd is a vector-valued function called the drift coefficient, g(t)∈R is a real-valued function called the diffusion coefficient, w denotes a standard Brownian motion, and dw can be viewed as infinitesimal white noise. The solution of a stochastic differential equation is a continuous collection of random variables {x(t)}t∈[0,T]. These random variables trace stochastic trajectories as the time index t grows from the start time 0 to the end time T. Let pt(x) denote the (marginal) probability density function of x(t). Here t∈[0,T] is analogous to i=1,2,⋯,L when we had a finite number of noise scales, and pt(x) is analogous to pσi(x). Clearly, p0(x)=p(x) is the data distribution since no perturbation is applied to data at t=0. After perturbing p(x) with the stochastic process for a sufficiently long time T, pT(x) typically becomes a simple noise distribution, which we denote as a prior distribution. This is analogous to pσL(x) in the case of finite noise scales, which corresponds to applying the largest noise perturbation σL to the data.

The SDE in (7) is hand designed, similarly to how we hand-designed σ1<σ2<⋯<σL in the case of finite noise scales. There are numerous ways to add noise perturbations, and the choice of SDEs is not unique. For example, the following SDE

(8)dx=etdw

perturbs data with a Gaussian noise of mean zero and exponentially growing variance, which is analogous to perturbing data with N(0,σ12I),N(0,σ22I),⋯,N(0,σL2I) when σ1<σ2<⋯<σL is a geometric progression. Therefore, the SDE should be viewed as part of the hyperparameters for the model, much like {σ1,σ2,⋯,σL}. In , we provide three SDEs that generally work well for images.

Reversing the SDE for sample generation

With a finite number of noise scales, we can generate samples by reversing the perturbation process with annealed Langevin dynamics, i.e., sequentially sampling from each noise-perturbed distribution using Langevin dynamics. For infinite noise scales, we can analogously reverse the perturbation process for sample generation by using the reverse SDE.

Generate data from noise by reversing the perturbation procedure.

Importantly, any SDE has a corresponding reverse SDE , whose closed form is given by

(9)dx=[f(x,t)−g2(t)∇xlog⁡pt(x)]dt+g(t)dw.

Here dt represents a negative infinitesimal time step, since the SDE (9) needs to be solved backwards in time (from t=T to t=0). In order to compute the reverse SDE, we need to estimate ∇xlog⁡pt(x), which is exactly the score function of pt(x).

Solving a reverse SDE yields a score-based generative model. Transforming data to a simple noise distribution can be accomplished with an SDE, which can be reversed if we know the score of the distribution at each intermediate time step.

Estimating the reverse SDE with score-based models and score matching

In order to estimate ∇xlog⁡pt(x), we train a Time-Dependent Score-Based Model sθ(x,t), such that sθ(x,t)≈∇xlog⁡pt(x). This is analogous to the noise-conditional score-based model sθ(x,i) used for finite noise scales, trained such that sθ(x,i)≈∇xlog⁡pσi(x).

Our training objective for sθ(x,t) is a continuous mixture of Fisher divergences, given by

(10)Et∈U(0,T)Ept(x)[λ(t)∥∇xlog⁡pt(x)−sθ(x,t)∥22],

where U(0,T) denotes a uniform distribution over the time interval [0,T], and λ:R→R>0 is a positive weighting function. Typically we use λ(t)∝1/E[∥∇x(t)log⁡p(x(t)∣x(0))∥22]. When λ(t)=g2(t), we have an important connection between our mixture of Fisher divergence objective and KL divergence under some regularity conditions :

(11)KL⁡(p0(x)∥q0(x))=T2Et∈U(0,T)Ept(x)[λ(t)∥∇xlog⁡pt(x)−∇xlog⁡qt(x)∥22],

where pt and qt denote the distributions of x(t) when x(0)∼p0 and x(0)∼q0 respectively. Due to this special connection to the KL divergence and the equivalence between KL divergence and maximum likelihood, we call λ(t)=g(t)2 the likelihood weighting function.

As before, our mixture of Fisher divergence objective can be efficiently optimized with score matching methods, such as denoising score matching and sliced score matching . Once our score-based model sθ(x,t) is trained to optimality, we can plug it into the form of reverse SDE in (9) to obtain an estimated reverse SDE.

(12)dx=[f(x,t)−g2(t)sθ(x,t)]dt+g(t)dw.

How to solve the reverse SDE

By solving the estimated reverse SDE with numerical SDE solvers, we can simulate the reverse stochastic process for sample generation. Perhaps the simplest numerical SDE solver is the Euler-Maruyama method. When applied to our estimated reverse SDE, it discretizes the SDE using finite time steps and small Gaussian noise. Specifically, it chooses a small negative time step Δt≈0, initializes t←T, and iterates the following procedure until t≈0:

Δx←[f(x,t)−g2(t)sθ(x,t)]Δt+g(t)|Δt|ztx←x+Δxt←t+Δt,

Here zt∼N(0,I). The Euler-Maruyama method is qualitatively similar to Langevin dynamics—both update x by following score functions perturbed with Gaussian noise.

Aside from the Euler-Maruyama method, other numerical SDE solvers can be directly employed to solve the reverse SDE for sample generation, including, for example, Milstein method, and stochastic Runge-Kutta methods. In , we provide a reverse diffusion solver similar to Euler-Maruyama, but more tailored for solving reverse-time SDEs.

In addition, there are two special properties of our reverse SDE that allow for even more flexible sampling methods:

We have an estimate of ∇xlog⁡pt(x) via our time-dependent score-based model sθ(x,t).
We only care about sampling from each marginal distribution pt(x). Samples obtained at different time steps can have arbitrary correlations and do not have to form a particular trajectory sampled from the reverse SDE.

As a consequence of these two properties, we can apply MCMC approaches to fine-tune the trajectories obtained from numerical SDE solvers. Specifically, we propose Predictor-Corrector samplers. The predictor can be any numerical SDE solver that predicts x(t+Δt)∼pt+Δt(x) from an existing sample x(t)∼pt(x). The corrector can be any MCMC procedure that solely relies on the score function, such as Langevin dynamics and Hamiltonian Monte Carlo.

At each step of the Predictor-Corrector sampler, we first use the predictor to choose a proper step size Δt, and then predict x(t+Δt) based on the current sample x(t). Next, we run several corrector steps to improve the sample x(t+Δt) according to our score-based model sθ(x,t+Δt), so that x(t+Δt) becomes a higher-quality sample from pt+Δt(x).

With Predictor-Corrector methods and better architectures of score-based models, we can achieve state-of-the-art sample quality on CIFAR-10 (measured in FID and Inception scores ), outperforming the best GAN model to date (StyleGAN2 + ADA ).

Method	FID ↓	Inception score ↑
StyleGAN2 + ADA	2.92	9.83
Ours	2.20	9.89

The sampling methods are also scalable for extremely high dimensional data. For example, it can successfully generate high fidelity images of resolution 1024×1024.

1024 x 1024 samples from a score-based model trained on the FFHQ dataset.

Some additional (uncurated) samples for other datasets (taken from this GitHub repo):

256 x 256 samples on LSUN bedroom.

256 x 256 samples on CelebA-HQ.

Probability flow ODE

Additionally, it is possible to convert any SDE into an ordinary differential equation (ODE) without changing its marginal distributions {pt(x)}t∈[0,T]. Thus by solving this ODE, we can sample from the same distributions as the reverse SDE. The corresponding ODE of an SDE is named probability flow ODE , given by

(13)dx=[f(x,t)−12g2(t)∇xlog⁡pt(x)]dt.

The following figure depicts trajectories of both SDEs and probability flow ODEs. Although ODE trajectories are noticeably smoother than SDE trajectories, they convert the same data distribution to the same prior distribution and vice versa, sharing the same set of marginal distributions {pt(x)}t∈[0,T]. In other words, trajectories obtained by solving the probability flow ODE have the same marginal distributions as the SDE trajectories.

We can map data to a noise distribution (the prior) with an SDE, and reverse this SDE for generative modeling. We can also reverse the associated probability flow ODE, which yields a deterministic process that samples from the same distribution as the SDE. Both the reverse-time SDE and probability flow ODE can be obtained by estimating score functions.

This probability flow ODE formulation has several unique advantages.

Exact log-likelihood computation

When ∇xlog⁡pt(x) is replaced by its approximation sθ(x,t), the probability flow ODE becomes a special case of a neural ODE. In addition, it is an example of continuous normalizing flows, since the probability flow ODE converts a data distribution p0(x) to a prior noise distribution pT(x) (recall that it shares the same marginal distributions as the SDE) and is fully invertible.

The probability flow ODE therefore inherits all properties of neural ODEs or continuous normalizing flows, including exact log-likelihood computation. Specifically, we can leverage the instantaneous change-of-variable formula (Theorem 1 in , Equation (4) in ) to compute the unknown data density p0 from the known prior density pT with numerical ODE solvers.

As a result, we can compute the exact log-likelihoods of our score-based models through the probability flow ODE formulation. In fact, our model achieves the state-of-the-art log-likelihoods on uniformly dequantized It is typical for normalizing flow models to convert discrete images to continuous ones by adding small uniform noise to them. CIFAR-10 images , even without maximum likelihood training.

Method	Negative log-likelihood (bits/dim) ↓
RealNVP	3.49
iResNet	3.45
Glow	3.35
FFJORD	3.40
Flow++	3.29
Ours	2.99

Manipulating latent representations

Similarly as in normalizing flow models, we can obtain a latent representation for any input by integrating the probability flow ODE from t=0 to t=T. We can then manipulate the latent space, and reconstruct the edited image by integrating the probability flow ODE backwards from t=T to t=0. This can provide continuous interpolations between two images, as shown below

Interpolating between the top left figure and the bottom right figure by doing spherical linear interpolation in the latent space.

Uniquely identifiable encoding

But what sets apart our method from your typical normalizing flow models is that our latent representations are uniquely identifiable . This means that any input will be uniquely mapped to the same latent code, given sufficient data, model capacity and optimization accuracy. By contrast, normalizing flow models may map the same input into different latent codes when using different model architectures or training with different random seeds.

Our latent codes are uniquely identifiable because the probability flow ODE in equation (13) does not have trainable parameters. Instead, the probability flow ODE, as well as latent codes obtained from it, is fully determined by the data distribution itself and the forward SDE. As long as sθ(x,t)≈∇xlog⁡pt(x), we will always have roughly the same latent code for the same input, no matter how sθ(x,t) was parameterized or trained.

Sanity check on identifiable encoding. We compare the first 100 dimensions of the latent code obtained for a random CIFAR-10 image for two different score-based models, where “Model A” and “Model B” are separately trained with different architectures. The two latent codes are very close to each other despite using different score-based models.

More efficient sample generation

We can obtain samples from the probability flow ODE by sampling from the prior noise distribution pT(x) and then integrating the ODE from t=T to t=0. This procedure is exactly the same as sampling from continuous normalizing flow models.

Just like in continuous normalizing flows, we can exploit highly-optimized adaptive ODE solvers to integrate the probability flow ODE for sample generation. These ODE solvers not only require fewer steps than numerical SDE solvers in many cases, but also allow trading off sample quality for sampling speed by tuning the numerical precision.

Probability flow ODE enables fast sampling with adaptive stepsizes as the numerical precision is varied (from left to right: 1e-1, 1e-3, 1e-5), and reduces the number of score function evaluations (NFE) without harming quality. In comparison, numerical SDE solvers may require 1000 NFE for generating samples of similar quality.

Although ODE solvers may be more efficient, numerical SDE solvers still have their own advantages. They typically provide samples of higher quality as measured by quantitative metrics like FID and Inception scores. They are also easier to incorporate in Predictor-Corrector methods, which can lead to even better sample quality. In addition, it is often easier to modulate their sampling procedure for controllable sample generation, a point on which we elaborate below.

Controllable generation for inverse problem solving

Score-based generative models are particularly suitable for solving inverse problems. At its core, inverse problems are same as Bayesian inference problems. Let x and y be two random variables, and suppose we know the forward process of generating y from x, represented by the transition probability distribution p(y∣x). The inverse problem is to compute p(x∣y). From Bayes’ rule, we have p(x∣y)=p(x)p(y∣x)/∫p(x)p(y∣x)dx. This expression can be greatly simplified by taking gradients with respect to x on both sides, leading to the following Bayes’ rule for score functions:

(14)∇xlog⁡p(x∣y)=∇xlog⁡p(x)+∇xlog⁡p(y∣x).

Through score matching, we can train a model to estimate the score function of the unconditional data distribution, i.e., sθ(x)≈∇xlog⁡p(x). This will allow us to easily compute the posterior score function ∇xlog⁡p(x∣y) from the known forward process p(y∣x) via equation (14), and sample from it with Langevin dynamics.

However, we know that score-based generative models work best when trained on a sequence of noise-perturbed data distributions. How can we solve inverse problems when perturbing data with an SDE and training time-dependent score-based models? With a perturbation SDE, we can first convert the posterior distribution p(x∣y) to a known noise distribution pT(x) by setting p0(x)=p(x∣y), and then reverse the stochastic process to convert a sample from the noise distribution x(T)∼pT(x) to a sample from the posterior distribution x(0)∼p0(x)=p(x∣y). This procedure is illustrated below:

We perturb the posterior distribution to a noise prior with an SDE, and then reverse it to sample from the posterior distribution, thereby solving the inverse problem.

The reverse stochastic process is described by a conditional reverse-time SDE :

(15)dx=[f(x,t)−g2(t)∇xlog⁡pt(y∣x)]dt+g(t)dw.

With the Bayes’ rule for score functions, we can rewrite ∇xlog⁡pt(x∣y) as below

(16)∇xlog⁡pt(x∣y)=∇xlog⁡pt(x)+∇xlog⁡pt(y∣x).

Here the first term is an unconditional time-dependent score function, which can be approximated by our time-dependent score-based model since sθ(x,t)≈∇xlog⁡pt(x). The second term ∇xlog⁡pt(y∣x) can often be obtained in two ways:

Training a separate model. Since we know the forward process p(y∣x), and we have training data from the data distribution {x1,x2,⋯,xN}∼i.i.d.p0(x), we can generate samples {(x1,y1),⋯,(xN,yN)}∼i.i.d.p0(x)p(y∣x). From our forward SDE, we also know how to perturb any clean data sample xi to a noisy one xi(t)∼pt(x∣xi). This means we can easily get samples {(x1(t),y1),⋯,(xN(t),yN)}∼i.i.d.pt(x)pt(y∣x) for any t, and use them to a train a model for pt(y∣x). One application of this approach is class-conditional image generation, where y represents a target class label. We can train a noise-conditional classifier qϕ(y,x,t)≈pt(y∣x) based on pairs of noise-perturbed images and the corresponding class label, and combine it with an unconditionally trained time-dependent score-based model sθ(x,t) for conditional sample generation.
Specified with domain knowledge. When the forward process is a linear model, i.e., p(y∣x)=δ(y=Mx), where M is a linear operator and δ(⋅) represents a point mass, it is often possible to directly specify an approximation to ∇xlog⁡pt(y∣x). There are many examples of linear forward problems, including image inpainting and colorization. Please find a general approach to specify ∇xlog⁡pt(y∣x) for linear inverse problems in Appendix I.4 of ref., or ref..

Class-conditional generation with an unconditional time-dependent score-based model, and a pre-trained noise-conditional image classifier on CIFAR-10.

Image inpainting with a time-dependent score-based model trained on LSUN bedroom. The leftmost column is ground-truth. The second column shows masked images (y in our framework). The rest columns show different inpainted images, generated by solving the conditional reverse-time SDE.

Image colorization with a time-dependent score-based model trained on LSUN church_outdoor and bedroom. The leftmost column is ground-truth. The second column shows gray-scale images (y in our framework). The rest columns show different colorizedimages, generated by solving the conditional reverse-time SDE.

We can even colorize gray-scale portrays of famous people in history (Abraham Lincoln) with a time-dependent score-based model trained on FFHQ. The image resolution is 1024 x 1024.

Concluding remarks

This blog post gives a detailed introduction to score-based generative modeling with noise perturbations. We demonstrate that this new paradigm of generative modeling is able to produce high quality samples, compute exact log-likelihoods, and perform controllable generation for inverse problem solving. It is a compilation of several papers we published in the last two years. Please visit them if you are interested in more details:

Score-based generative models are strongly connected to diffusion probabilistic models, a type of VAEs with multiple stochastic layers first proposed by Jascha Sohl-Dickstein and his colleagues . Last year, Jonathan Ho and colleagues showed that the evidence lower bound (ELBO) used for training diffusion probabilistic models is essentially equivalent to the mixture of score matching objectives used in score-based generative modeling. Moreover, by parameterizing the decoder as a sequence of score-based models, they demonstrated for the first time that diffusion models can generate high quality image samples comparable to GANs. Later, we showed in an ICLR paper that the Langevin-type sampling method of diffusion probabilistic models can be unified with annealed Langevin dynamics of score-based models to create a more powerful sampler (the Predictor-Corrector sampler).

These latest developments suggest that both score-based generative modeling with multiple noise perturbations and denoising diffusion probabilistic models are just different perspectives of the same model family, much like how wave mechanics and matrix mechanics are equivalent formulations of quantum mechanics in the history of physicsIt is needless to say that the significance of score-based generative models/diffusion probabilistic models is in no way comparable to quantum mechanics.. The perspective of score matching and score-based models allows one to calculate log-likelihoods exactly, solve inverse problems naturally, and is directly connected to energy-based models. The perspective of diffusion models is naturally connected to VAEs and can be directly incorporated with variational probabilistic inference. This blog post focuses on the first perspective, but I highly recommend interested readers to learn about the alternative perspective of denoising diffusion probabilistic models (I might also write a blog post about it in the future).

There are two major challenges of score-based generative models. First, the sampling speed is slow since it involves a large number of Langevin-type iterations. Second, it is inconvenient to work with discrete data distributions since scores are only defined on continuous distributions.

The first challenge can be partially solved by using numerical ODE solvers for the probability flow ODE with lower precision (a similar method, denoising diffusion implicit modeling, has been proposed in ref.). It is also possible to learn a direct mapping from the latent space of probability flow ODEs to the image space, as shown in . However, all such methods result in worse sample quality.

The second challenge can be addressed by learning an autoencoder on discrete data and performing score-based generative modeling on its continuous latent space . Jascha’s original work on diffusion models also provides a discrete diffusion process for discrete data distributions, but its potential for large scale applications remains yet to be proven.

It is my prediction that these challenges will soon be solved with the joint efforts of the research community, and score-based generative models/ diffusion-based models will become one of the most useful tools for data generation, density estimation, inverse problem solving, and many other downstream tasks in machine learning.

Hereafter we only consider probability density functions. Probability mass functions are similar. [↩]
Commonly used score matching methods include denoising score matching and sliced score matching . Here is an introduction to score matching and sliced score matching. [↩]
It is typical for normalizing flow models to convert discrete images to continuous ones by adding small uniform noise to them. [↩]
It is needless to say that the significance of score-based generative models/diffusion probabilistic models is in no way comparable to quantum mechanics.[↩]
The neural autoregressive distribution estimator
Larochelle, H. and Murray, I., 2011. International Conference on Artificial Intelligence and Statistics, pp. 29--37.
Made: Masked autoencoder for distribution estimation
Germain, M., Gregor, K., Murray, I. and Larochelle, H., 2015. International Conference on Machine Learning, pp. 881--889.
Pixel recurrent neural networks
Van Oord, A., Kalchbrenner, N. and Kavukcuoglu, K., 2016. International Conference on Machine Learning, pp. 1747--1756.
NICE: Non-linear independent components estimation
Dinh, L., Krueger, D. and Bengio, Y., 2014. arXiv preprint arXiv:1410.8516.
Density estimation using Real NVP
Dinh, L., Sohl-Dickstein, J. and Bengio, S., 2017. International Conference on Learning Representations.
A tutorial on energy-based learning
LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M. and Huang, F., 2006. Predicting structured data, Vol 1(0).
How to Train Your Energy-Based Models
Song, Y. and Kingma, D.P., 2021. arXiv preprint arXiv:2101.03288.
Auto-encoding variational bayes
Kingma, D.P. and Welling, M., 2014. International Conference on Learning Representations.
Stochastic backpropagation and approximate inference in deep generative models
Rezende, D.J., Mohamed, S. and Wierstra, D., 2014. International conference on machine learning, pp. 1278--1286.
Learning in implicit generative models
Mohamed, S. and Lakshminarayanan, B., 2016. arXiv preprint arXiv:1610.03483.
Generative adversarial nets
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y., 2014. Advances in neural information processing systems, pp. 2672--2680.
Improved techniques for training gans
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A. and Chen, X., 2016. Advances in Neural Information Processing Systems, pp. 2226--2234.
Unrolled Generative Adversarial Networks [link]
Metz, L., Poole, B., Pfau, D. and Sohl-Dickstein, J., 2017. 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
A kernelized Stein discrepancy for goodness-of-fit tests
Liu, Q., Lee, J. and Jordan, M., 2016. International conference on machine learning, pp. 276--284.
Estimation of non-normalized statistical models by score matching
Hyvarinen, A., 2005. Journal of Machine Learning Research, Vol 6(Apr), pp. 695--709.
A connection between score matching and denoising autoencoders
Vincent, P., 2011. Neural computation, Vol 23(7), pp. 1661--1674. MIT Press.
Generative Modeling by Estimating Gradients of the Data Distribution [PDF]
Song, Y. and Ermon, S., 2019. Advances in Neural Information Processing Systems, pp. 11895--11907.
Improved Techniques for Training Score-Based Generative Models [PDF]
Song, Y. and Ermon, S., 2020. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
Denoising diffusion probabilistic models
Ho, J., Jain, A. and Abbeel, P., 2020. arXiv preprint arXiv:2006.11239.
Score-Based Generative Modeling through Stochastic Differential Equations [link]
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S. and Poole, B., 2021. International Conference on Learning Representations.
WaveGrad: Estimating Gradients for Waveform Generation [link]
Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M. and Chan, W., 2021. International Conference on Learning Representations.
DiffWave: A Versatile Diffusion Model for Audio Synthesis [link]
Kong, Z., Ping, W., Huang, J., Zhao, K. and Catanzaro, B., 2021. International Conference on Learning Representations.
Learning Gradient Fields for Shape Generation
Cai, R., Yang, G., Averbuch-Elor, H., Hao, Z., Belongie, S., Snavely, N. and Hariharan, B., 2020. Proceedings of the European Conference on Computer Vision (ECCV).
Symbolic Music Generation with Diffusion Models
Mittal, G., Engel, J., Hawthorne, C. and Simon, I., 2021. arXiv preprint arXiv:2103.16091.
A review on deep learning in medical image reconstruction
Zhang, H. and Dong, B., 2020. Journal of the Operations Research Society of China, pp. 1--30. Springer.
Training products of experts by minimizing contrastive divergence
Hinton, G.E., 2002. Neural computation, Vol 14(8), pp. 1771--1800. MIT Press.
Sliced score matching: A scalable approach to density and score estimation [PDF]
Song, Y., Garg, S., Shi, J. and Ermon, S., 2020. Uncertainty in Artificial Intelligence, pp. 574--584.
Correlation functions and computer simulations
Parisi, G., 1981. Nuclear Physics B, Vol 180(3), pp. 378--384. Elsevier.
Representations of knowledge in complex systems
Grenander, U. and Miller, M.I., 1994. Journal of the Royal Statistical Society: Series B (Methodological), Vol 56(4), pp. 549--581. Wiley Online Library.
Adversarial score matching and improved sampling for image generation [link]
Jolicoeur-Martineau, A., Piche-Taillefer, R., Mitliagkas, I. and Combes, R.T.d., 2021. International Conference on Learning Representations.
Reverse-time diffusion equation models
Anderson, B.D., 1982. Stochastic Processes and their Applications, Vol 12(3), pp. 313--326. Elsevier.
On Maximum Likelihood Training of Score-Based Generative Models
Durkan, C. and Song, Y., 2021. arXiv preprint arXiv:2101.09258.
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. and Hochreiter, S., 2017. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, {USA}, pp. 6626--6637.
Training Generative Adversarial Networks with Limited Data
Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J. and Aila, T., 2020. Proc. NeurIPS.
Neural Ordinary Differential Equations
Chen, T.Q., Rubanova, Y., Bettencourt, J. and Duvenaud, D., 2018. Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr{\'{e}}al, Canada, pp. 6572--6583.
Scalable Reversible Generative Models with Free-form Continuous Dynamics [link]
Grathwohl, W., Chen, R.T.Q., Bettencourt, J. and Duvenaud, D., 2019. International Conference on Learning Representations.
On linear identifiability of learned representations
Roeder, G., Metz, L. and Kingma, D.P., 2020. arXiv preprint arXiv:2007.00810.
Solving linear inverse problems using the prior implicit in a denoiser
Kadkhodaie, Z. and Simoncelli, E.P., 2020. arXiv preprint arXiv:2007.13640.
Deep unsupervised learning using nonequilibrium thermodynamics
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. and Ganguli, S., 2015. International Conference on Machine Learning, pp. 2256--2265.
Denoising Diffusion Implicit Models [link]
Song, J., Meng, C. and Ermon, S., 2021. International Conference on Learning Representations.
Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed
Luhman, E. and Luhman, T., 2021. arXiv e-prints, pp. arXiv--2101.