I am new to SE-Data Science, therefore I hope this is the right place to ask this rather theoretical question.
In diffusion models we usually have a time variable which determines the noise schedule (e.g. $T \in [0,…4000]$). For training we sample a random number from this interval, compute the noisy image at this time $t$ and feed both (the noisy image and the time) to the neural network to predict the noise.
So far so good… However, when reading deeper into this topic I saw many people which did not feed the time $t$ to the neural network, but instead they conditioned the model directly on the noise level $$\bar \alpha_t =\prod_{t‘}^t (1-\beta_{t‘})$$ This is for example illustrated in this example from the Keras website or more in detail in this paper.
On the Keras page it is mentioned that
Diffusion models embed the index of the timestep of the diffusion process instead of the noise variance, while score-based models usually use some function of the noise level. I prefer the latter so that we can change the sampling schedule at inference time, without retraining the network
My Question is:
I am confused by the sentence
so that we can change the sampling schedule at inference time
(i.e. we are not bound to go through all 4000 steps in reverse but to use e.g. only 200 reverse steps).
Why is it only possible to change the timestep at inference when we condition directly on $\bar \alpha$? We could for example condition the model on $T \in [0,…4000]$ and at inference (backward process) just jump over e.g. every second time step. Since there is a one-to-one correspondence between $t$ and $\bar\alpha$, this procedure should be equivalent? Of course, the quality of the resulting image is reduced.
In short: I do not understand why we go through this process and condition the diffusion model on $\bar\alpha$ instead of just $t$.
Thank you for your answers!