By Jaewon Kim in ai — Jul 4, 2024

[CVPR 2022] Diffusion Tutorials (DDIM, Score-based Models)

본 글은 위의 영상 내용을 기반으로 공부한 것을 기록한 것입니다.

기존의 "CVPR 2022 Diffusion Tutorial" 에 관한 공식문서는 아래에서 확인할 수 있다.

1. Denoising Diffusion Implicit Models (DDIM)

DDIM에서는 DDPM에서의 Markovian Process를 Non-Markovian Process로 Modeling하는 시도를 수행했다.

이때, DDIM은 Pre-trained Diffusion Model을 사용하되 DDPM에서의 Loss도 그대로 사용하여 Sampling의 방식만 변경했다는 장점을 가지고 있다.

DDIM에서는 아래와 같이 식이 나오는데, 그 이유를 살펴보자.

$$q_\sigma(x_{1:T}|x_0) := q_\sigma(x_T|x_0)\prod_{t=2}^T q_\sigma(x_{t-1}|x_t, x_0)$$

DDPM에서는 아래와 같은 식이 성립한다는 것을 안다:

$$\quad q_\sigma(\mathbf{x}_{1:T}|\mathbf{x}_0)=q_\sigma(\mathbf{x}_1|\mathbf{x}_{0}) \prod _{t=2}^T q_\sigma(\mathbf{x}_t|\mathbf{x}_{t-1}, \mathbf{x}_{0})$$

$$= q(\mathbf{x}_1|\mathbf{x}_0)q(\mathbf{x}_2|\mathbf{x}_1, \mathbf{x}_0)q(\mathbf{x}_3|\mathbf{x}_2, \mathbf{x}_0)...q(\mathbf{x}_T|\mathbf{x}_{T-1}, \mathbf{x}_0)$$

위의 식에 아래의 식을 이용하여 Tele-Scoping을 진행한다.

$$q_\sigma(x_t|x_{t-1}, x_0) = \frac{q_\sigma(x_{t-1}|x_t, x_0)q_\sigma(x_t|x_0)}{q_\sigma(x_{t-1}|x_0)}$$

$$q_\sigma(x_{t-1}|x_t, x_0)$$

위의 항은 고정되고

$$\frac{q_\sigma(x_t|x_0)}{q_\sigma(x_{t-1}|x_0)}$$

항이 $t=2$부터 $t=T$까지 Telescoping 되어 사라진다.

왜 아래의 식이 성립하는가? $$q_\sigma(x_t|x_{t-1}, x_0) = \frac{q_\sigma(x_{t-1}|x_t, x_0)q_\sigma(x_t|x_0)}{q_\sigma(x_{t-1}|x_0)}$$

먼저, 베이즈 정리를 적용한다: $$q_\sigma(x_{t-1}|x_t, x_0) = \frac{q_\sigma(x_t, x_{t-1}, x_0)}{q_\sigma(x_t, x_0)}$$
분모와 분자 모두를 $q_\sigma(x_0)$로 나눈다: $$\frac{q_\sigma(x_t, x_{t-1}, x_0) / q_\sigma(x_0)}{q_\sigma(x_{t-1}| x_0)} $$
분자에 $q_\sigma(x_t, x_0)$를 곱하고 나눈다: $$\frac{\frac{q_\sigma(x_t, x_{t-1}, x_0)}{q_\sigma(x_t, x_0)} \frac {q_\sigma(x_t, x_0)}{q_\sigma(x_0)}}{q_\sigma(x_{t-1}| x_0)} $$
최종적으로 다음과 같이 정리된다: $$\frac{q_\sigma(x_{t-1}|x_t, x_0)q_\sigma(x_t|x_0)}{q_\sigma(x_{t-1}|x_0)}$$

$$= q_{\sigma}(\mathbf{x}_{1}|\mathbf{x}_{0}) \frac {q_{\sigma}(\mathbf{x}_{T}|\mathbf{x}_{0})} {q_{\sigma}(\mathbf{x}_{1}|\mathbf{x}_{0})} \prod_{t=2}^T q_\sigma(x_{t-1}|x_t, x_0) $$

$$= q_{\sigma}(\mathbf{x}_{T}|\mathbf{x}_{0}) \prod_{t=2}^T q_\sigma(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) $$

즉, DDIM 에서는 Diffusion Process를 새롭게 다시 정의내린 것이다.

$q_\sigma(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$을 Re-Parameterization을 이용하여 구하기 위해서는 $\mathbf{x}_{t-1}$를 $\mathbf{x}_t, \mathbf{x}_0$에 대한 식으로 나타내면 된다.

$$x_{t-1} = \sqrt{\bar{\alpha}_{t-1}}x_0 + \sqrt{1 - \bar{\alpha}_{t-1}}z_{t-1}$$

여기서, DDPM에서 사용했던 "Merge two Gaussians"의 역과정을 수행하여

$\sqrt{1 - \bar{\alpha}_{t-1}}z_{t-1} = \sqrt{1 - \bar{\alpha}_{t-1} - \sigma_t^2}z_t + \sigma_t z$ 으로 역변환한다.

$$= \sqrt{\bar{\alpha}_{t-1}}x_0 + \sqrt{1 - \bar{\alpha}_{t-1} - \sigma_t^2}z_t + \sigma_t z$$

$z_t$를 $x_t$와 $x_0$에 관한 식으로 정리하면 아래와 같이 식을 쓸 수 있다:

$x_{t} = \sqrt{\bar{\alpha}_{t}}x_0 + \sqrt{1 - \bar{\alpha}_{t}}z_{t}$ 해당 식을 $z_t$에 관해 정리하면 된다.

$$= \sqrt{\alpha_{t-1}}x_0 + \sqrt{1 - \alpha_{t-1} - \sigma_t^2} \frac{x_t - \sqrt{\alpha_t}x_0}{\sqrt{1 - \alpha_t}} + \sigma_t z$$

따라서, $q_\sigma(x_{t-1}|x_t, x_0)$는 Mean과 Variance가 각각

$$\sqrt{\alpha_{t-1}}x_0 + \sqrt{1 - \alpha_{t-1} - \sigma_t^2}\frac{x_t - \sqrt{\alpha_t}x_0}{\sqrt{1 - \alpha_t}}, \sigma_t^2\mathbf{I}$$

인 Gaussian Distribution을 따른다.

$$q_\sigma(x_{t-1}|x_t, x_0) = \mathcal{N}(x_{t-1}; \sqrt{\alpha_{t-1}}x_0 + \sqrt{1 - \alpha_{t-1} - \sigma_t^2}\frac{x_t - \sqrt{\alpha_t}x_0}{\sqrt{1 - \alpha_t}}, \sigma_t^2\mathbf{I})$$

따라서, 아래와 같이 정리가능하다.

$$q(x_{t-1}|x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}(x_t, x_0), \tilde{\beta}_t\mathbf{I})$$

$$\tilde{\beta}_t = \sigma_t^2 = \frac{1 - \alpha_{t-1}}{1 - \alpha_t} \cdot \beta_t$$

이때, $\mathbf{x}_0$ 역시 마찬가지로 $\mathbf{x}_{t} = \sqrt{\bar{\alpha}_{t}}x_0 + \sqrt{1 - \bar{\alpha}_{t}}\varepsilon$ 의 식을 통해 $\mathbf{x}_t$에 대한 식으로 정리할 수 있다.

1) $\mathbf{x}_t = \sqrt{\alpha_t} \mathbf{x}_0 + \sqrt{(1 - \alpha_t)} \epsilon$
2) $\mathbf{x}_0 = \frac{\mathbf{x}_t - \sqrt{1-\alpha_t}\epsilon}{\sqrt{\alpha_t}}$
3) $\epsilon_\theta^{(t)}(\mathbf{x}_t) = \frac{\mathbf{x}_t - \sqrt{\alpha_t}\mathbf{x}_0}{\sqrt{1 - \alpha_t}}$

이므로 아래의 식의 $\mathbf{x}_0$에 2)를 대입하고 $\frac{\mathbf{x}_t - \sqrt{\alpha_t}\mathbf{x}_0}{\sqrt{1 - \alpha_t}}$에 3)을 대입하자.

$$q_\sigma(\mathbf{x}_{t-1}|x_t, x_0) = \mathcal{N}(\mathbf{x}_{t-1}; \sqrt{\alpha_{t-1}}\mathbf{x}_0 + \sqrt{1 - \alpha_{t-1} - \sigma_t^2}\frac{\mathbf{x}_t - \sqrt{\alpha_t}\mathbf{x}_0}{\sqrt{1 - \alpha_t}}, \sigma_t^2\mathbf{I})$$

따라서 위의 식은 아래의 식과 같이 변형된다.

$$\mathbf{x}_{t-1} = \sqrt{\alpha_{t-1}} \left(\frac{\mathbf{x}_t - \sqrt{1-\alpha_t}\epsilon_\theta^{(t)}(\mathbf{x}_t)}{\sqrt{\alpha_t}}\right) + \sqrt{1 - \alpha_{t-1} - \sigma_t^2} \cdot \epsilon_\theta^{(t)}(\mathbf{x}_t) + \sigma_t\epsilon_t$$

결과적으로,

DDPM에서는 $q(\mathbf{x}_{t}|\mathbf{x}_{t-1})$에 Markov Chain과 Baye's Rule을 이용해서 Posterior Distribution의 $q(\mathbf{x}_{t-1}|\mathbf{x}_{t}, \mathbf{x}_{0})$의 Mean, Variance를 구했으나

DDIM에서는 $q_\sigma(\mathbf{x}_t|\mathbf{x}_{t-1}, \mathbf{x}_{0})$에서 바로 이전 state와 최초 state를 이용하여 현재 state를 정의내림으로써, $q_\sigma(\mathbf{x}_{t-1}|\mathbf{x}_{t}, \mathbf{x}_{0})$를 바로 Gaussian Distribution으로 구할 수 있다.

만약 $\tilde{\epsilon}_{t} = 0, \forall t$ 라면 이는 Deterministic Generative Process가 되고, 오직 $t=T$에서 randomness를 가진다.

즉, randomness를 추가해주는 확률변수가 없기 때문 Noise - Image가 1:1 Matching되는 현상이 일어난다.

이러한 Deterministic Process는 Faster Sampling이 가능하도록 해준다.

2. Score-based Generative Modeling with Differential Equations

Score-Based Generative Modeling은 Diffusion Model과 수학적으로 같다.

해당 Gaussian Noise에서 Sampling 하는 식으로 표현을 해보면 아래와 같다:

$$q(x_t | x_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t\mathbf{I}) \Rightarrow \mathbf{x}_t = \sqrt{1-\beta_t}\mathbf{x}_{t-1} + \sqrt{\beta_t}\mathcal{N}(0, \mathbf{I})$$

$$x_t = \sqrt{1-\beta_t}\mathbf{x}_{t-1} + \sqrt{\beta_t}\mathcal{N}(0,\mathbf{I})$$

$(\beta_t := \beta(t)\Delta t)$ Taylor expansion에 의해 위의 식을 다음과 같이 작성할 수 있다.

$$\mathbf{x}_t = \sqrt{1-\beta(t)\Delta t}\mathbf{x}_{t-1} + \sqrt{\beta(t)\Delta t}\mathcal{N}(0,\mathbf{I})$$

Taylor Expansion에 의해 1차항까지만 계산하면 아래와 같이 $\sqrt{ }$가 풀리게 된다.

$$\mathbf{x}_t \approx \mathbf{x}_{t-1} - \frac{\beta(t)\Delta t}{2}\mathbf{x}_{t-1} + \sqrt{\beta(t)\Delta t}\mathcal{N}(0,\mathbf{I})$$

$\mathbf{x}_t \approx \mathbf{x}_{t-1} - \frac{\beta(t)\Delta t}{2}\mathbf{x}_{t-1} + \sqrt{\beta(t)\Delta t}\mathcal{N}(0, \mathbf{I})$

이때, 위의 식은 아래와 같이 Stochastic Differential Equation (SDE)의 형태로 나타낼 수 있다.

Stochastic한 특성을 지닌 DE를 SDE라 하는데, 분자의 운동(Brown 운동)을 해석하기 위해 만들어진 식 중 하나이다.

$$d\mathbf{x}_t = -\frac{1}{2}\beta(t)\mathbf{x}_t dt + \sqrt{\beta(t)} d\omega_t$$

$ -\frac{1}{2}\beta(t)\mathbf{x}_t dt$: Drift Term (방향성)
$\sqrt{\beta(t)} d\omega_t$: Diffusion Term (Brown 운동이 가지는 무작위성)

Drift, Diffusion Term을 다음과 같이 $f(t)$와 $g(t)$로 정한다면 General한 SDE는 다음과 같이 표현가능하다.

$$d\mathbf{x}_t =f(t) \mathbf{x}_t dt + g(t) d\omega_t$$

DPPM과 같은 경우에는 위 SDE의 Special Case라 생각하면 된다.

아래와 같이 Forward SDE를 정의했다면, 해당 논문에서는 Reverse SDE를 어떻게 정의해야 하는지에 관해서 다룬다.

$d\mathbf{x}_t = -\frac{1}{2}\beta(t)\mathbf{x}_t dt + \sqrt{\beta(t)} d\omega_t$: Forward Diffusion SDE
$d\mathbf{x}_t = \left[-\frac{1}{2}\beta(t)\mathbf{x}_t - \beta(t)\nabla{\mathbf{x}_t} \log q_t(\mathbf{x}_t)\right] dt + \sqrt{\beta(t)} d\bar{\omega}_t$ : Reverse Generative Diffusion SDE

수식적으로 1:1 Matching이 되도록 정의했고, Drift Term 안에 $\nabla{\mathbf{x}_t} \log q_t(\mathbf{x}_t)$ Gradient가 곱해지는 형태가 된다.

결국 원하는 것은 Generative Image이기 때문에, Reverse DIffusion SDE를 풀어야 한다. 다만, 이 때 빨간색 부분에 해당하는 Score Function인 $\nabla{\mathbf{x}_t} \log q_t(\mathbf{x}_t)$를 어떻게 구해야 하는지에 관해 문제가 된다.

$\beta(t)$와 같은 Noise Injection Parameter는 사전에 미리 정의할 수 있기 때문에, Score Function만 알 수 있다면 Reverse Diffusion SDE를 확정할 수 있다.

Naive한 방법은 아래와 같이 Neural Network를 사용해서 Score Function $\nabla{\mathbf{x}_t} \log q_t(\mathbf{x}_t)$을 학습하는 방법인데, 이는 $\log q_t(\mathbf{x}_t)$가 untractable 하기에 구할 수 없다.

Naive Idea:

$$\min_{\theta} \mathbb{E}_{t\sim\mathcal{U}(0,T)}\mathbb{E}_{\mathbf{x}_t\sim q_t(\mathbf{x}_t)}|\mathbf{s}_\theta(\mathbf{x}t,t) - \nabla{\mathbf{x}_t} \log q_t(\mathbf{x}_t)|_2^2$$

다만 특정 time step $t$에서의 image $\mathbf{x}_t$를 $\mathbf{x}_0$를 통해 구할 수 있기 때문에 $q_t(\mathbf{x}_t|\mathbf{x}_0)$는 tractable하다.

Denoising Score Matching:

$$\min_{\theta} \mathbb{E}_{t\sim\mathcal{U}(0,T)}\mathbb{E}_{\mathbf{x}_0\sim q_0(\mathbf{x}_0)}\mathbb{E}_{\mathbf{x}_t\sim q_t(\mathbf{x}_t|\mathbf{x}_0)}|\mathbf{s}_\theta(\mathbf{x}_t,t) - \nabla{\mathbf{x}_t} \log q_t(\mathbf{x}_t|\mathbf{x}_0)|_2^2$$

따라서, Score Matching을 $q_t(\mathbf{x}_t|\mathbf{x}_0)$를 통해 진행할 수 있게 된다.

아래 3개의 식을 통해 정리하면, 다음과 같은 최종 식을 얻을 수 있다.

Variance-Preserving SDE

$d\mathbf{x}_t = -\frac{1}{2}\beta(t)\mathbf{x}_t dt + \sqrt{\beta(t)} d\omega_t$
$q_t(\mathbf{x}_t|\mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \gamma_t\mathbf{x}_0, \sigma_t^2\mathbf{I})$
$\gamma_t = e^{-\frac{1}{2}\int_0^t \beta(s)ds}$
$\sigma_t^2 = 1 - e^{-\int_0^t \beta(s)ds}$

Re-parametrized Sampling

$$\mathbf{x}_t = \gamma_t\mathbf{x}_0 + \sigma_t\epsilon \quad \epsilon \sim \mathcal{N}(0,\mathbf{I})$$

Score Function

$$\nabla_{\mathbf{x}_t} \log q_t(\mathbf{x}_t|\mathbf{x}_0) = -\nabla{\mathbf{x}_t} \frac{(\mathbf{x}_t - \gamma_t\mathbf{x}_0)^2}{2\sigma_t^2} = -\frac{\mathbf{x}_t - \gamma_t\mathbf{x}_0}{\sigma_t^2} = -\frac{\gamma_t\mathbf{x}_0 + \sigma_t\epsilon - \gamma_t\mathbf{x}_0}{\sigma_t^2} = -\frac{\epsilon}{\sigma_t}$$

Neural Network Model

$$\mathbf{s}_\theta(\mathbf{x}_t,t) := -\frac{\epsilon\theta(\mathbf{x}_t,t)}{\sigma_t}$$

Final Equation:

$$\min_{\theta} \mathbb{E}_{t\sim\mathcal{U}(0,T)}\mathbb{E}_{\mathbf{x}_0\sim q_0(\mathbf{x}_0)}\mathbb{E}_{\epsilon\sim\mathcal{N}(0,\mathbf{I})}\frac{1}{\sigma_t^2}|\epsilon - \epsilon\theta(\mathbf{x}_t,t)|_2^2$$

결과적으로는 Score-based Model과 DDPM은 정의되어 있는 식들이 다르지만(둘 다 General SDE의 구조에 해당되기는 한다), Training 했던 Loss Function은 동일하다는 것을 알 수 있다.

이 때, DDPM에서 DDIM으로 넘어갈 때 Random을 제거하면서 Deterministic하게 만들 수 있는데, 이는 Score-based Model에서도 Probabilistic Flow ODE를 통해 동일하게 만들 수 있다.

Reverse Generative Diffusion ODE: $$d\mathbf{x}_t = \left[-\frac{1}{2}\beta(t)\mathbf{x}_t - \beta(t)\nabla{\mathbf{x}_t} \log q_t(\mathbf{x}_t)\right] dt + \sqrt{\beta(t)} d\bar{\omega}_t$$

Probability Flow ODE: $$d\mathbf{x}_t = \left[-\frac{1}{2}\beta(t)\mathbf{x}_t - \beta(t)\nabla{\mathbf{x}_t} \log q_t(\mathbf{x}_t)\right] dt$$

이는 Stochastic한 값들이 들어있는 SDE를 Ordinary Differential Equation인 ODE로 변형할 수 있다는 뜻이다.

Generative Diffusion SDE
- SDE의 경우 Data Distribution이 Gaussian Noise로 가지만, 가는 과정에서 Stochastic한 process가 진행되는 것을 볼 수 있다.
- $d\mathbf{x}_t = -\frac{1}{2}\beta(t)[\mathbf{x}_t + 2s_\theta(\mathbf{x}_t, t)]dt + \sqrt{\beta(t)}d\bar{\omega}_t$

Sampling Method(Euler - Maruyama)

$$\mathbf{x}_{t-1} = \mathbf{x}_t + \frac{1}{2}\beta(t)[\mathbf{x}_t + 2s_\theta(\mathbf{x}_t, t)]\Delta t + \sqrt{\beta(t)\Delta t}\mathcal{N}(0, \mathbf{I})$$

Probability Flow ODE
- ODE의 경우 1:1 Matching이 되는 모습을 볼 수 있다.
- $d\mathbf{x}_t = -\frac{1}{2}\beta(t)[\mathbf{x}_t + s_\theta(\mathbf{x}_t, t)]dt$

Sampling Method (Euler's Method)

$$\mathbf{x}_{t-1} = \mathbf{x}_t + \frac{1}{2}\beta(t)[\mathbf{x}_t + s_\theta(\mathbf{x}_t, t)]\Delta t$$

따라서, SDE와 ODE를 어떻게 하면 더 잘 풀 수 있을지에 대해 여러 논문들이 있고, 이러한 논문들은 거의 DDPM과 동일하다고 볼 수 있다.

정리하자면, SDE의 Special Case로 DDPM과 Score-base Model이 있고 둘은 정의된 식 구조가 다를 뿐 근본적으로는 동일하다.

3. Conditional Diffusion Models

주어진 Image를 Generate할 때, (Text) Guide를 준다든가, 특정 Class Label만 뽑고 싶다든가, Inpainting을 하는 등의 Condition을 주어서 Sampling을 하고 싶을 수 있다.

Control Signal $y$를 입력으로 주어서 어떠한 Controllable한 Generation을 하고자 할 때,

Conditional Reverse SDE 식을 다음과 같이 쓸 수 있다:

$$d\mathbf{x} = [\mathbf{f}(\mathbf{x}, t) - g^2(t)\nabla_\mathbf{x} \log p_t(\mathbf{x} | \mathbf{y})]dt + g(t)d\mathbf{w}$$

그런데 이 때, $\nabla_\mathbf{x} \log p_t(\mathbf{x} | \mathbf{y})$ 부분이 untractable하기에 Bayes' Rule을 이용하여 다음과 같이 나누어서 학습할 수 있다.

$$d\mathbf{x} = [\mathbf{f}(\mathbf{x}, t) - g^2(t)\nabla_\mathbf{x} \log p_t(\mathbf{x}) - g^2(t)\nabla_\mathbf{x} \log p_t(\mathbf{y} | \mathbf{x})]dt + g(t)d\mathbf{w}$$

log function 이므로 위와 같이 2개의 항으로 분리가능하다.

$\nabla_\mathbf{x} \log p_t(\mathbf{x})$: Unconditional Score Function (Pre-trained Model)
$\nabla_\mathbf{x} \log p_t(\mathbf{y} | \mathbf{x})$: Score Function과 별개로 학습가능

$\nabla_\mathbf{x} \log p_t(\mathbf{y} | \mathbf{x})$은 $p_t(\mathbf{y} | \mathbf{x})$을 구해야 하므로, Time-Dependent Classifier으로 볼 수 있다.

이를 Classifier Guidance라 하는데, 수식적으로 표현을 해보면 아래와 같다.

Main Idea는 추가적으로 Classifier의 Gradient 값을 더해줘서 sampling을 하면 해당 class의 이미지를 뽑아낼 수 있다는 것이다.

Sample with a modified score:

$$\nabla_{\mathbf{x}_t}[\log p(\mathbf{x}_t|\mathbf{c}) + \omega \log p(\mathbf{c}|\mathbf{x}_t)] $$

Approximate samples from the distribution:

$$\tilde{p}(\mathbf{x}_t|\mathbf{c}) \propto p(\mathbf{x}_t|\mathbf{c})p(\mathbf{c}|\mathbf{x}_t)^\omega$$