Diffusion Models: Conditioning on Time vs. Noise Level

Question

I am new to SE-Data Science, therefore I hope this is the right place to ask this rather theoretical question.

In diffusion models we usually have a time variable which determines the noise schedule (e.g. $T \in [0,…4000]$). For training we sample a random number from this interval, compute the noisy image at this time $t$ and feed both (the noisy image and the time) to the neural network to predict the noise.

So far so good… However, when reading deeper into this topic I saw many people which did not feed the time $t$ to the neural network, but instead they conditioned the model directly on the noise level $$\bar \alpha_t =\prod_{t‘}^t (1-\beta_{t‘})$$ This is for example illustrated in this example from the Keras website or more in detail in this paper.

On the Keras page it is mentioned that

Diffusion models embed the index of the timestep of the diffusion process instead of the noise variance, while score-based models usually use some function of the noise level. I prefer the latter so that we can change the sampling schedule at inference time, without retraining the network

My Question is:

I am confused by the sentence

so that we can change the sampling schedule at inference time

(i.e. we are not bound to go through all 4000 steps in reverse but to use e.g. only 200 reverse steps).

Why is it only possible to change the timestep at inference when we condition directly on $\bar \alpha$? We could for example condition the model on $T \in [0,…4000]$ and at inference (backward process) just jump over e.g. every second time step. Since there is a one-to-one correspondence between $t$ and $\bar\alpha$, this procedure should be equivalent? Of course, the quality of the resulting image is reduced.

In short: I do not understand why we go through this process and condition the diffusion model on $\bar\alpha$ instead of just $t$.

Thank you for your answers!

score 2 · Accepted Answer · answered May 28 '24 at 16:33

They should have phrased it a bit differently...

Each timestep t is associated with some noise level for a particular schedule, but the noise levels sampled during training do not have to be used when generating images.

The important thing is that the network gets consistent conditioning for whatever the expected noise level is supposed to be for the model's input. If you condition directly on the timestep value t you would need to translate the alternate schedule's noise levels to timestep values, which isn't impossible obviously, it's just a nuisance that can be avoided by directly using noise levels as the conditioning instead.

Diffusion Models: Conditioning on Time vs. Noise Level

1 Answers1