6

I have been trying to understand the proof of the adjoint sensitivity method to calculate the gradients $dJ/d\theta$ of a loss functional, \begin{align*} J(\theta) &= L(x(T))+\int_{0}^{T}\ell(x(t))dt \\ \text{s.t.}\quad &\dot{x}(t) = f(x(t),\theta), \quad x(0)=x_0. \end{align*} Using the adjoint method we get, \begin{align*} \dfrac{dJ}{d\theta} &= \int_{0}^{T} \lambda^\top f_\theta\,dt\\ \text{where }\,\dot{\lambda}^\top &= -\ell_x - \lambda^\top f_x,\quad \lambda(T)=L'(x(T)). \end{align*}

The proof I usually encountered introduced the Lagrangian multiplier $\lambda$, \begin{equation} \mathcal{L} = L\left(x(T)\right)+\int_{0}^{T}\ell(x(t))dt - \int_{0}^{T}\lambda(t)^\top\left(\dot{x}(t) - f(x(t),\theta)\right)dt. \end{equation} and then argued as follows, \begin{align*} \dfrac{dJ}{d\theta} &= \dfrac{d\mathcal{L}}{d\theta} \stackrel{(i)}{=} \dfrac{dL}{d\theta} - \dfrac{d}{d\theta}\left[\lambda x\Big\rvert_{0}^{T}\right]+ \dfrac{d}{d\theta}\int_{0}^{T}\left[l+\dot{\lambda}^\top x + \lambda^\top f \right]dt\\ &\stackrel{(ii)}{=} \left[ L'(x(T))-\lambda(T)\right]x_\theta(T)\\ &+ \int_{0}^{T}\left[l_x+\dot{\lambda}^\top + \lambda^\top f_x \right]x_\theta\, dt + \int_{0}^{T} \lambda^\top f_\theta \,dt \end{align*} Here, in $(i)$ we integrated by parts and in $(ii)$ we rearranged terms and used the fact that $x_\theta(0)=0$. Now, if the adjoint equation is satisfied the result holds.

My questions are:

  1. To me the introduction of $\lambda$ looks a bit "magic" here. I understand that at an optimum we may use Lagrangian multipliers and then have a condition $d\mathcal{L}/d\theta = 0$. But here we are not at an optimum in general. Is there a better explanation for this trick?
  2. Why is it ok to assume that $\lambda$ doesn't depend on $\theta$ when we take the partial derivatives?
  3. Is there a more intuitive proof using functional derivatives instead?
andrschl
  • 163

1 Answers1

1
  1. I would say this is quite common. In static optimization, you do the same thing to constraints. Effectively, the state dynamics are a constraint to the minimization of the cost functional. So to "get rid" of the equality constraint, you adjoin it to the cost functional and derive the gradient based on the augmented cost. The fact that this also works away from the optimal solution is motivated by the fact that the dynamic constraint must always evaluate to zero. So $$\frac{\partial J}{\partial \theta}=\frac{\partial \mathcal{L}}{\partial \theta}$$ holds always.

  2. I would say that this is based on the definition of partial derivatives. It might be true that the total derivative of the Lagrange multipliers w.r.t. $\theta$ is not zero (and, I believe, it should be)

  3. I don't know about an intuitive derivation, but you can come to the same conclusion based on the calculus of variations. Remember that your cost is a functional, i.e., a function of functions. To derive the "gradient" (which, in this case is called Gateaux derivative) then involves varying all your degrees of freedom, i.e., your state trajectory and your input trajectory. In other words, define the Gateaux derivative like this $$ \delta J(\boldsymbol{u},\boldsymbol{\omega})=\lim_{\eta\rightarrow 0}\frac{J(\boldsymbol{u}+\eta\boldsymbol{\omega})-J(\boldsymbol{u})}{\eta} $$ where $\boldsymbol{\omega}$ is a function! Basically, you use the same arguments as in your derivation (adjoin state dynamics, integrate by parts, setting difficult terms to zero using the adjoint state $\boldsymbol{\lambda}$.) In that process, you need to carefully evaluate the variations in your functions, i.e., at which points they can be different from their corresponding function. For example if $\boldsymbol{u}(t,\eta)=\boldsymbol{u}(t)+\eta\boldsymbol{\omega}(t)$ it is usually imposed that $\boldsymbol{\omega}(t_0)=\boldsymbol{0}$. But for $\boldsymbol{\omega}(t_f)$ this depends on the problem, i.e., if it has terminal constraints or not etc.

link
  • 148