I have been trying to understand the proof of the adjoint sensitivity method to calculate the gradients $dJ/d\theta$ of a loss functional, \begin{align*} J(\theta) &= L(x(T))+\int_{0}^{T}\ell(x(t))dt \\ \text{s.t.}\quad &\dot{x}(t) = f(x(t),\theta), \quad x(0)=x_0. \end{align*} Using the adjoint method we get, \begin{align*} \dfrac{dJ}{d\theta} &= \int_{0}^{T} \lambda^\top f_\theta\,dt\\ \text{where }\,\dot{\lambda}^\top &= -\ell_x - \lambda^\top f_x,\quad \lambda(T)=L'(x(T)). \end{align*}
The proof I usually encountered introduced the Lagrangian multiplier $\lambda$, \begin{equation} \mathcal{L} = L\left(x(T)\right)+\int_{0}^{T}\ell(x(t))dt - \int_{0}^{T}\lambda(t)^\top\left(\dot{x}(t) - f(x(t),\theta)\right)dt. \end{equation} and then argued as follows, \begin{align*} \dfrac{dJ}{d\theta} &= \dfrac{d\mathcal{L}}{d\theta} \stackrel{(i)}{=} \dfrac{dL}{d\theta} - \dfrac{d}{d\theta}\left[\lambda x\Big\rvert_{0}^{T}\right]+ \dfrac{d}{d\theta}\int_{0}^{T}\left[l+\dot{\lambda}^\top x + \lambda^\top f \right]dt\\ &\stackrel{(ii)}{=} \left[ L'(x(T))-\lambda(T)\right]x_\theta(T)\\ &+ \int_{0}^{T}\left[l_x+\dot{\lambda}^\top + \lambda^\top f_x \right]x_\theta\, dt + \int_{0}^{T} \lambda^\top f_\theta \,dt \end{align*} Here, in $(i)$ we integrated by parts and in $(ii)$ we rearranged terms and used the fact that $x_\theta(0)=0$. Now, if the adjoint equation is satisfied the result holds.
My questions are:
- To me the introduction of $\lambda$ looks a bit "magic" here. I understand that at an optimum we may use Lagrangian multipliers and then have a condition $d\mathcal{L}/d\theta = 0$. But here we are not at an optimum in general. Is there a better explanation for this trick?
- Why is it ok to assume that $\lambda$ doesn't depend on $\theta$ when we take the partial derivatives?
- Is there a more intuitive proof using functional derivatives instead?