In the paper Video Diffusion Models, Section 3.1 mentions the following equation: $$ E_q[x^b|,z_t,x^a] = E_q[x^b|z_t] + (\frac{{\sigma}_t^2}{{\alpha}_t})\nabla_{z_t^b}\log q(x^a|z_t)$$, where $x^a, x^b$ are two video samples, $q$ is forward noisification process given by: $$q(z_t|x)=\mathcal{N}(z_t;{\alpha_t}x, \sigma^2I )$$
Can someone explain how the Expectation equation is derived ?