5

Why does the Hessian matrix $$\left( {\begin{array}{cc} \frac{\partial^2f}{\partial x^2} & \frac{\partial^2f}{\partial x \partial y} \\ \frac{\partial^2f}{\partial y \partial x} & \frac{\partial^2f}{\partial y^2} \\ \end{array} } \right)$$ work and where does it come from?

I just recently came across this in a multivaraible calculus course. It was used to determine whether an extremum of a function with 2 variables is a maximum or minimum or "saddle point". Can anyone explain why it pops up here and how it helps understand the properties of an extremum?

John Doe
  • 994
  • 6
  • 16
  • 5
    A proof of what? A derivation of what? Asking for someone to "derive the Hessian matrix" is as nonsensical as asking for someone to "prove the number $2$." What fact about the Hessian matrix are you interested in, exactly? – Ben Grossmann Aug 12 '20 at 12:03
  • 1
    @BenGrossmann I want to know how the formula of the Hessian matrix is derived. Where does it come from? Why does it look the way it does? – John Doe Aug 12 '20 at 12:11
  • 2
    Again, your question is not well formulated. Usually, the "Hessian matrix" of $f$ is defined to be a matrix containing the second order partial derivatives of $f$. So the only reasonable answer to "why does it look the way it does" is "because that's the definition." The Hessian matrix is not "derived," so it does not make sense to ask how this is done. "Where does the Hessian matrix come from," however, is the start of a reasonable question that you could perhaps elaborate on. – Ben Grossmann Aug 12 '20 at 12:16
  • @BenGrossmann I see, thank you for your feedback. I just want to know why it shows up in the context it does, or as you put it: "Where it comes from". – John Doe Aug 12 '20 at 12:18
  • 1
    Ok, That's another step in the right direction. Could you be more specific about the context in which you encountered the Hessian matrix? Perhaps you recently learned about the usual procedure for finding the local minima/maxima of a multivariate function. – Ben Grossmann Aug 12 '20 at 12:19
  • 3
    You might find this post and this post about the Hessian matrix to be illuminating. If neither gives you the answer you're looking for, perhaps you could use these as examples of how you might narrow down your question. – Ben Grossmann Aug 12 '20 at 12:23
  • @BenGrossmann Yes! This is my first encounter with it and it is in the context of determining whether an extrema is a maximum, minimum or "saddle point" in a multivariable function with 2 variables. – John Doe Aug 12 '20 at 12:24
  • In some sense it's the gradient of the gradient. – Cameron L. Williams Aug 12 '20 at 12:25
  • @user271418 Great, that clears things up a lot. It would be helpful if you could edit your question to reflect what you have just told me. – Ben Grossmann Aug 12 '20 at 12:29
  • @BenGrossmann I will. Thank you. – John Doe Aug 12 '20 at 12:30
  • in general, the Taylor approximation of a smooth function at a point is linear up to error so one knows the local behavior, but at a critical point where the linear coefficients are zero, the Taylor polynomial becomes a quadratic $ax^2+2bxy+cy^2$ up to error and by completing the square one knows to classify such if the discriminant $4(b^2-ac)$ is not zero - the hessian is precisely the matrix that gives $a=\frac{\partial^2f}{\partial x^2},b=\frac{\partial^2f}{\partial x \partial y},c=\frac{\partial^2f}{\partial y^2}$ – Conrad Aug 12 '20 at 12:33
  • @BenGrossmann: I disagree with your assessment. The Hessian is the matrix representing the second total derivative of a twice totally differentiable function. Its entries can then be derived to be the second partial derivatives. A matrix containing these partial derivatives of a function which isn't twice totally differentiable doesn't deserve the name Hessian, just as a matrix containing the first partial derivatives of a not totally differentiable function doesn't deserve the name Jacobian or gradient. – Vercassivelaunos Aug 12 '20 at 12:39
  • @Vercassivelaunos That would be a valid way to define the Hessian, but it is not particularly common. Also, I strongly suspect that the asker has not encountered the notion of the "total derivative," which means that your explanation, while insightful, is likely inaccessible to the intended audience. – Ben Grossmann Aug 12 '20 at 12:42
  • @BenGrossmann is the multivaraible chain rule implied by the total derivative? – John Doe Aug 12 '20 at 12:44
  • @user That's a tricky question to untangle. The technically correct answer to your question is that the multivariate chain rule is a consequence of the definition of "differentiability," and that if $Df$ denotes the total derivative of $f$, then all incarnations of the chain rule can be expressed in the form $$ D(f \circ g)(x) = (Df)(g(x)) \cdot (Dg)(x). $$ The answer to the question I suspect you mean to ask is that yes: the "multivariable chain rule" that (I think) you are used to is a consequence of the chain rule for the total derivative. – Ben Grossmann Aug 12 '20 at 12:48

4 Answers4

5

The Fundamental Strategy of Calculus is to take a nonlinear function (difficult) and approximate it locally by a linear function (easy). If $f:\mathbb R^n \to \mathbb R$ is differentiable at $x_0$, then our local linear approximation for $f$ is $$ f(x) \approx f(x_0) + \nabla f(x_0)^T(x - x_0). $$ But why not approximate $f$ instead by a quadratic function? The best quadratic approximation to a smooth function $f:\mathbb R^n \to \mathbb R$ near $x_0$ is $$ f(x) \approx f(x_0) + \nabla f(x_0)^T (x - x_0) + \frac12 (x - x_0)^T Hf(x_0)(x - x_0) $$ where $Hf(x_0)$ is the Hessian of $f$ at $x_0$.

littleO
  • 54,048
  • So the only reason the Hessian pops up here is because it was decided that a quadratic approximation would be used instead of a linear? If so, why? Is the quadratic approximation better somehow? If precision is the goal, why not then use a cubic approximation or an even more precise approximation? – John Doe Aug 12 '20 at 12:51
  • 2
    @user271418 You revealed in your question that the Hessian helps you determine whether a critical point is a maximum or minimum or "saddle point". That is one possible answer to "why look at a quadratic approximation" – Mark S. Aug 12 '20 at 12:57
  • 1
    I understand now. Thank you very much for your answer. – John Doe Aug 12 '20 at 12:58
5

Let's suppose for simplicity that the critical point we are trying to analyze is $p=(0,0)$.

Take some direction $u$. If we compute $(f(tu))''(0),$ we are analyzing the concavity of the restriction of $f$ to the $(u,z)$ plane by single-variable calculus. For example, if this value is positive for every direction $u$, then $f$ has a point of local minimum at $p$.

Computing $(f(tu))''(0)$, you arrive at $\langle \mathrm{Hess}f(p) u, u\rangle$. This alone tells us how the Hessian appears when analyzing if a critical point is a local minimum, saddle or local maximum. But let's understand why the determinant is relevant in the two dimensional case.

It is known that if $A$ is a symmetric matrix, the function \begin{align} g:\mathbb{R}^n &\to \mathbb{R} \\ x &\mapsto \langle Ax,x \rangle, \end{align} when restricted to the sphere $S^{n-1}$, achieves its maximum and minimum value at eigenvectors of $A$. (You can prove this using Lagrange multipliers for example.) Note that if $v$ is an eigenvector then $g(v)=\langle Av ,v \rangle=\langle \lambda v,v \rangle=\lambda$. So if all eigenvalues are positive, then $g$ is positive and $p$ is a local minimum, if there is one positive eigenvalue and one negative then it is a saddle and if all are negative, then it is a local minimum.

Since the determinant is the product of the eigenvalues, analyzing it is enough to determine the information of the signs of the eigenvalues in two dimensions if the Hessian is non-degenerate. If the determinant is positive, then both eigenvalues are either both positive or both negative. (Thus a local maximum or minimum. Then we look at, for example, the sign of $\partial_1^2f=\langle \mathrm{Hess}f(p)e_1,e_1 \rangle$ to determine which case.) If it is negative, then they are of opposite signs, thus a saddle.

Aloizio Macedo
  • 35,361
  • 6
  • 74
  • 155
4

The Hessian is an essential part of the multidimensional Taylor expansion of a sufficiently smooth function. Total differentiability of a function $f:U\to\mathbb R$ in $x_0\in U$ for an open subset $U\subseteq \mathbb R^n$ means that there is a linear map $L:\mathbb R^n\to \mathbb R$ such that

$$\lim_{x\to x_0}\frac{f(x)-[f(x_0)+L(x-x_0)]}{\Vert x-x_0\Vert}=0.$$

That's the definition of total differentiability. The term in $[]$ is then the first order Taylor approximation of $f$ around $x_0$, and we call $L$ the gradient. The equation essentially tells us that as we go to $x_0$, the difference between $f$ and its Taylor approximation gets arbitrarily small quickly. We could also derive that the gradient's matrix representation is $\nabla f(x_0)$, but I'll skip this.

Now if $f$ is twice totally differentiable, this means that additionally there is a bilinear form $B:\mathbb R^n\times\mathbb R^n\to\mathbb R$ such that

$$\lim_{x\to x_0}\frac{f(x)-[f(x_0)+L(x-x_0)+\frac{1}{2}B(x-x_0,x-x_0)]}{\Vert x-x_0\Vert^2}=0.$$

This is not a definition, but the statement of one of the several versions of Taylor's theorem. The term in $[]$ is now the second order Taylor approximation, and we call $B$ (or rather its matrix representation) the Hessian of $f$, and we get $B(v,w)=w^T \mathrm Hf(x_0) v$. It also happens to be the total differential of the function $x\mapsto \nabla f(x)$, which would allow us to derive its components, but again, I'll skip that.

With this, the Taylor approximation of a twice totally differentiable function becomes

$$f(x)\approx f(x_0)+\nabla f(x_0)\cdot(x-x_0)+\frac{1}{2}(x-x_0)^T \cdot\mathrm Hf(x_0)\cdot(x-x_0).$$

From here it might be intuitively clear why the Hessian tells us about the type of critical point. If $\nabla f=0$, then the Taylor approximation is just a constant plus the Hessian term. And if the Hessian is positive or negative definite, it means that this term either only increases (positive definite) or decreases (negative definite) if $x-x_0$ moves away from 0 (and thus $x$ moves away from $x_0$). So we have to be at a minimum/maximum. If it is indefinite, however, that means that as $x$ goes away from $x_0$ in some direction, the Hessian term increases, while in another direction it decreases. So we have to be at a saddle point.

0

Let ${ g(z) }$ be a real valued thrice differentiable function with inputs ${ z \in \mathbb{R} ^N . }$ Let

$${ f(t) := g(z + t \Delta z) . }$$

We have the heuristics

$${ {\begin{aligned} \Delta f \approx &\, \sum _{i = 1} ^N \frac{\partial g}{\partial z _i} (z + t \Delta z) \, (\Delta (t \Delta z )) _i \\ = &\, \Delta t \sum _{i = 1} ^N \frac{\partial g}{\partial z _i} (z + t \Delta z) \, \Delta z _i \end{aligned}} }$$

that is

$${ \frac{df(t)}{dt} = \sum _{i = 1} ^N \frac{\partial g}{\partial z _i} (z + t \Delta z) \, \Delta z _i . }$$

Differentiating once again, we have

$${ {\begin{align} \frac{d ^2 f(t)}{dt ^2} = &\, \sum _{i = 1} ^N \frac{d}{dt} \left( \frac{\partial g}{\partial z _i} (z + t \Delta z) \right) \, \Delta z _i \\ = &\, \sum _{i = 1} ^N \left( \sum _{ j = 1} ^N \frac{\partial}{\partial z _j} \frac{\partial g}{\partial z _i} ( z + t \Delta z) \, \Delta z _j \right) \, \Delta z _i \\ = &\, \sum _{i, j = 1} ^N \frac{\partial ^2 g}{\partial z _j \partial z _i}(z + t \Delta z) \, \Delta z _j \Delta z _i . \end{align}} }$$

Differentiating once again, we have

$${ {\begin{align} \frac{d ^3 f(t)}{d t ^3 } = &\, \sum _{i, j = 1} ^N \frac{d}{dt} \left( \frac{\partial ^2 g}{\partial z _j \partial z _i}(z + t \Delta z) \right) \, \Delta z _j \Delta z _i \\ = &\, \sum _{i, j = 1} ^N \left( \sum _{k = 1} ^N \frac{\partial}{\partial z _k} \frac{\partial ^2 g}{\partial z _j \partial z _i } (z + t \Delta z) \, \Delta z _k \right) \, \Delta z _j \Delta z _i \\ = &\, \sum _{ i, j, k = 1} ^N \frac{\partial ^3 g}{\partial z _k \partial z _j \partial z _i} (z + t \Delta z) \, \Delta z _k \Delta z _j \Delta z _i \end{align}} }$$

and so on.

Substituting this in the Taylor expansion of ${ f }$ near ${ 0 }$ namely

$${ f(t) \approx f(0) + f ^{'} (0) \, t + \frac{f ^{(2)} (0) }{2!} t ^2 + \frac{f ^{(3)} (0)}{3!} t ^3 + \ldots }$$

we have

$${ {\begin{align} &\, g(z + t \Delta z) \\ \approx &\, g(z) + t \sum _{i = 1} ^N \frac{\partial g}{\partial z _i} (z) \, \Delta z _i + \frac{t ^2}{2!} \sum _{i, j = 1} ^N \frac{\partial ^2 g}{\partial z _j \partial z _i}(z) \, \Delta z _j \Delta z _i + \frac{t ^3}{3!} \sum _{ i, j, k = 1} ^N \frac{\partial ^3 g}{\partial z _k \partial z _j \partial z _i} (z) \, \Delta z _k \Delta z _j \Delta z _i + \ldots \end{align}} }$$

that is

$${ \boxed{{\begin{align} &\, g(z + \Delta z) \\ \approx &\, g(z) + \sum _{i = 1} ^N \frac{\partial g}{\partial z _i} (z) \, \Delta z _i + \frac{1}{2!} \sum _{i, j = 1} ^N \frac{\partial ^2 g}{\partial z _j \partial z _i}(z) \, \Delta z _j \Delta z _i + \frac{1}{3!} \sum _{ i, j, k = 1} ^N \frac{\partial ^3 g}{\partial z _k \partial z _j \partial z _i} (z) \, \Delta z _k \Delta z _j \Delta z _i + \ldots \end{align}}} }$$

Defining the gradient and the Hessian as

$${ \nabla _z g(z) := \left( \frac{\partial g(z)}{\partial z _1}, \, \cdots \, , \frac{\partial g(z)}{\partial z _N} \right) }$$

and

$${ Hg(z) := \left( \frac{\partial ^2 g (z)}{\partial z _i \partial z _j}\right) _{i , j = 1} ^{N} }$$

the second order Taylor expansion can be written as

$${ g(z + \Delta z) \approx g(z) + \nabla _z g (z) \, \Delta z + \frac{1}{2!} (\Delta z) ^T Hg(z) \, (\Delta z) . }$$