10

Prove that the Hessian matrix of a quadratic form $f(x)=x^TAx$ is $f^{\prime\prime}(x) = A + A^T$.


I am not even sure what the Jacobian looks like (I never did one for $x \in \Bbb R^n$). Please help.

Smajl
  • 716

7 Answers7

9

So let's compute the first derivative, by definition we need to find $f'(x)\colon\mathbb R^n \to \mathbb R^n$ such that $$ f(x+h) = f(x) + f'(x)h + o(h), \qquad h \to 0 $$ We have \begin{align*} f(x+h) &= (x+h)^tA(x+h)\\ &= x^tAx + h^tAx + x^tAh + h^tAh\\ &= f(x) + x^t(A + A^t)h + h^tAh \end{align*} As $|h^tAh|\le \|A\||h|^2 = o(h)$, we have $f'(x) = x^t(A + A^t)$ for each $x \in \mathbb R^n$. Now compute $f''$, we have \begin{align*} f'(x+h) &= x^t(A + A^t) + h^t(A + A^t)\\ &= f(x) + h^t(A + A^t) \end{align*} So $f''(x) = A + A^t$.

martini
  • 86,011
  • I don't understand how in the last step we can make $x^t(A+A^t)=f(x)$? – user 6663629 Oct 05 '19 at 15:04
  • 1
    @user6663629 maybe it's supposed to be $f^{\prime}$ instead of $f$ – user36028 Nov 06 '20 at 07:55
  • given that $h$ is also a vector of dimension $n$ like $x$, are we dividing by h like in the scalar case or are we supposed to mulitpliy with its inverse ($h^{-1}$) ? For example how do we get rid of $h^{T}$ in the last step $h^{T}(A+A^T)$ ? – cangozpi Dec 19 '24 at 11:14
9

Intuitively, the gradient and Hessian of $f$ satisfy \begin{equation} f(x + \Delta x) \approx f(x) + \nabla f(x)^T \Delta x + \frac12 \Delta x^T Hf(x) \Delta x \end{equation} and the Hessian is symmetric.

In this problem, \begin{align*} f(x + \Delta x) &= (x + \Delta x)^T A (x + \Delta x) \\ &= x^T A x + \Delta x^T A x + x^T A \Delta x + \Delta x^T A \Delta x \\ &= x^T A x + \Delta x^T(A + A^T)x + \frac12 \Delta x(A + A^T) \Delta x. \end{align*}

Comparing this with the approximate equality above, we see that $\nabla f(x) = (A + A^T) x$ and $Hf(x) = A + A^T$.

littleO
  • 54,048
  • This is a really fast way to find the answer but I have a question. Where is the formula you gave as the "intuitive formula for the gradient" coming from, and is this approach applicable to any setting to find the exact gradients and Hessians always ? – cangozpi Dec 19 '24 at 11:20
6

Write explicitly $$f(x)=\sum_{i,j}(\text{2nd degree monomials})$$ The hessian is the matrix $$H=(\partial_i\partial_jf(x)).$$

4

For $f(x)=x^{\top}Ax$ where $f(x)\colon\mathbb R^n \to \mathbb R^1$, the Jacobian $f'(x)\colon\mathbb R^n \to \mathbb R^n$ can be solved as

$f'(x)=\lim_{h\to0}\frac{f(x+h)-f(x)}{h}$

$f(x+h)=(x+h)^{\top}A(x+h)=(x^{\top}A+h^{\top}A)(x+h)=x^{\top}Ax+x^{\top}Ah+h^{\top}Ax+h^{\top}Ah$

$f(x+h)=f(x)+x^{\top}Ah+x^{\top}A^{\top}h+h^{\top}Ah=f(x)+x^{\top}(A+A^{\top})h+h^{\top}Ah$

$f'(x)=\lim_{h\to0}\frac{f(x)+x^{\top}(A+A^{\top})h+h^{\top}Ah-f(x)}{h}=\lim_{h\to0}\frac{(x^{\top}(A+A^{\top})+h^{\top}A)h}{h}$

$f'(x)=\lim_{h\to0}x^{\top}(A+A^{\top})+h^{\top}A=x^{\top}(A+A^{\top})$

Thus, the Hessian $f''(x)\colon\mathbb R^n \to \mathbb R^{n\times n}$ can be found as

$f''(x)=\lim_{h\to0}\frac{f'(x+h)-f'(x)}{h}$

$f'(x+h)=(x+h)^{\top}(A+A^{\top})=x^{\top}(A+A^{\top})+h^{\top}(A+A^{\top})$

$f''(x)=\lim_{h\to0}\frac{x^{\top}(A+A^{\top})+h^{\top}(A+A^{\top})-x^{\top}(A+A^{\top})}{h}=\lim_{h\to0}A+A^{\top}$

Finally $f''(x)=A+A^{\top}$

  • Given that $h$ is a vector, in the last step how does $\dfrac{h^{T}(A+A^T)}{h}=A+A^T$ use $\dfrac{h^T}{h} = 1$ ? Even the dimensions of $h^T$ and $h$ do not match. – cangozpi Dec 19 '24 at 11:27
1

For all that wonder about the step to change an expression with $h^{\top}$ into one with h:

$x^{\top}Ah+h^{\top}Ax = x^{\top}Ah+x^{\top}A^{\top}h = x^{\top}(A+A^{\top})h$

you can take a matrix and show by calculation that $h^{\top}Ax$ is the same as $x^{\top}A^{\top}h$:

$A=\left(\begin{matrix}A_{1,1}&A_{1,2}\\A_{2,1}&A_{2,2}\end{matrix}\right)$

$A^{\top}=\left(\begin{matrix}A_{1,1}&A_{2,1}\\A_{1,2}&A_{2,2}\end{matrix}\right)$

$h^{\top}Ax=\left(\begin{matrix}h&h\end{matrix}\right)\left(\begin{matrix}A_{1,1}&A_{1,2}\\A_{2,1}&A_{2,2}\end{matrix}\right)\left(\begin{matrix}x_1\\x_2\end{matrix}\right)=\left(\begin{matrix}(A_{1,1}+A_{2,1})h&(A_{1,2}+A_{2,2})h\end{matrix}\right)\left(\begin{matrix}x_1\\x_2\end{matrix}\right)$

$=h(A_{1,1}+A_{2,1})x_1+h(A_{1,2}+A_{2,2})x_2$

$x^{\top}A^{\top}h=\left(\begin{matrix}x_1&x_2\end{matrix}\right)\left(\begin{matrix}A_{1,1}&A_{2,1}\\A_{1,2}&A_{2,2}\end{matrix}\right)\left(\begin{matrix}h\\h\end{matrix}\right)=\left(\begin{matrix}x_1&x_2\end{matrix}\right)\left(\begin{matrix}(A_{1,1}+A_{2,1})h\\(A_{1,2}+A_{2,2})h\end{matrix}\right)$

$=h(A_{1,1}+A_{2,1})x_1+h(A_{1,2}+A_{2,2})x_2$

BUT there is another unrelated problem with the formula for the gradient. A gradient is a column vector and $x^{\top}(A+A^{\top})$ produces a row vector. AND how do you devide by a vector $h=\left(\begin{matrix}h\\..\\h\end{matrix}\right))$? It doesn't work and I think this is the reason why we get a rowvector instead of a columnvector. One probably needs to use a directional derivative to be proper. But what is the gradient written as a directional derivative?

This stuff here has some practical application to construct gradients and Hessians for square forms of Laplacians. And well if using Newton optimization you cannot plug a row vector.

(For comparison as it is suggested above https://en.wikipedia.org/wiki/Taylor_series section Taylor series in several variables)

$T(\mathbf{x}) = f(\mathbf{a}) + (\mathbf{x} - \mathbf{a})^\mathsf{T} D f(\mathbf{a}) + \frac{1}{2!} (\mathbf{x} - \mathbf{a})^\mathsf{T} \left \{D^2 f(\mathbf{a}) \right \} (\mathbf{x} - \mathbf{a}) + \cdots, $

It is not really good to compare the results here with the taylor formula from wikipedia because there youd multiply the gradient and the Hessian with $x$ and $x^{\top}$, where here we are interested in the gradient and the Hessian only. Though you can see that the gradient needs to be a column vector

I think that the gradient of $f(x)=x^{\top}Ax$ will be $\nabla f(x)=(A+A^{\top})x$ but missing a proof.

Edit: $Df(x)$ is apparently the transpose, thus it should be $Df(x)=x^{\top}(A+A^{\top})$ (see comments below)

  • There is a subtle technical difference between the gradient $\nabla f$ and the first derivative (or differential) $D f$ of a function $f: \mathbb{R}^n \to \mathbb{R}$, namely that they are transposes of each other (see wiki: gradient). – MSDG Dec 29 '19 at 12:38
  • Aah. Thx! Didnt know that! Yes that is really important to know! Then the taylor series formula above is thus technically speaking problematic? Confusing! I mean x must be a column vector, right? (Otherwise you could not multiply $x^{\top}$) with the Hessian from the left) A_nd if you try to muliply $x^{\top}=(x_1 x_2 ... x_n)$ with Df if it is a row vector $(Df=(f_{x1} f_{x2} ... f_{xn}))$ that doesnt work out....

    :-oo

    – Sönke Schmachtel Dec 29 '19 at 13:06
  • https://en.wikipedia.org/wiki/Gradient#Derivative – Sönke Schmachtel Dec 29 '19 at 13:15
  • It is not problematic, one just needs to be aware what the notation represents. In the Wikipedia article for Taylor series expansions it is clearly stated below the formula that $Df$ denotes the gradient, not the differential (so it is a column vector, and the multiplication that you find problematic is well-defined). There are several conventions for denoting these things. Personally I like to denote the differential by $\mathrm df$, and the gradient by $\nabla f$ (or $\text{grad } f$). – MSDG Dec 29 '19 at 13:31
  • 1
    Yes :-) Thumbs up! "where D f (a) is the gradient of f evaluated at x = a " should have read it better – Sönke Schmachtel Dec 29 '19 at 13:37
0

Digging further into quadratic forms I came accross the fact that a quadratic form of a nonsymmetric matrix A can be always rewritten as a symmetrix quadratic form via $x^{\top}Ax=x^{\top}\frac{A+A^{\top}}{2}x=x^{\top}A_{sym}x$

https://math.stackexchange.com/a/3203658/738033

and especially wonderfull is then that $A_{sym}=Q^{\top}\Lambda Q$ where Q is an orthogonal matrix, also if A is positive definite (if it is a Laplacian in particular) you could use Cholesky decomposition $A_{sym}=LL^{T}$ or the related LDL factorization for linearly constrained problems (positive semidefinite) :-)

0

If you happen to work in machine learning and read Christopher M. Bishop's classic text "Pattern Recognition and Machine Learning", it is like doing univariate calculus:

First order derivative (gradient): $$\nabla f({\bf x})=\frac{\partial{\bf x^T}{\bf A}{\bf x}}{\partial{\bf x}}=\frac{\partial\rm{Tr}({\bf x^T}{\bf A}{\bf x})}{\partial{\bf x}}=\bigl({\bf x}^T({\bf A}+{\bf A}^T)\bigr)^T=2{\bf A}{\bf x}$$ by equation (C.27). Here we have used symmetry of $\bf A$. Jacobian is the row form of the gradient, i.e., the transpose of the gradient.

Further, second order derivative (Hessian): $${\bf H}=\frac{\partial\nabla f({\bf x})}{\partial{\bf x}}=2\frac{\partial}{\partial{\bf x}}({\bf A}{\bf x})=2\left(\frac{\partial{\bf A}}{\partial{\bf x}}{\bf x}+{\bf A}\frac{\partial{\bf x}}{\partial{\bf x}}\right)=2({\bf0}+{\bf A}{\bf I})=2{\bf A}={\bf A}+{\bf A}^T$$ by equation (C.20). So, if you are familiar with rules and notations of matrix derivative in Bishop's text, it's very easy and intuitive.

zzzhhh
  • 163