Approximation of Hessian=$J^TJ$ for general non-linear optimization problems

Question

My question is: when is the aprroximation of Hessian matrix $H=J^TJ$ reasonable?

One truth is that it is reasonable to approximate Hessian with first order derivatives (jacobian), i.e., $H=J^TJ$ when we are solving a non-linear least square problem (which is called Gauss-Newton method). In other words, that is the case when the cost function (energy function) is in quadratic form. This can be derived from Newton's method. See wiki: https://en.wikipedia.org/wiki/Gauss–Newton_algorithm

But is there any other cases when the aprroximation of Hessian matrix $H=J^TJ$ reasonable? Such as, for some of the general non-linear optimization problems?

I have noticed that some papers (in the field of Computer Vision) used Gauss-Newton or Levenberg–Marquardt (L-M) algorithm to solve non-linear non-least-square (i.e. general non-linear optimization) problems. (Which, in fact, is using the approximation of $H=J^TJ$) But none of them have actually explained why it is reasonable.

I have used this strategy in my own research too, and the experiments proved it to be efficient. But still, I don't know how to justify the hessian approximation mathemetically. (And I was asked by a reviewer in my recent journal paper submission.)

So again, is there any hints about how to justify the aprroximation of Hessian matrix $H=J^TJ$ for some of the general non-linear optimization problems?

Thank you very much for your kind help!

What do you mean by the Jacobian $J$ for a general nonlinear minimization of $f(x)$ which isn't a sum of squares? It's unclear what you're asking. — Brian Borchers, Apr 12 '18 at 02:07
Thank you for your attention. As far as I understand, we can use Newton's method to solve a general non-linear minimization problem which isn't a sum of squares. The update rule for Newton's method is $delta=-H^{-1}*J$. What I am trying to ask is that is there any cases that we can approximate the hessian matrix H here with $J^TJ$ ? Because I found a lot of publications in the area of computer vision use this approximation, but none of them has ever justified the approximation mathematically. Thank you again! @Brian Borchers — MapleWings, Apr 12 '18 at 14:01
You should edit your question to give a specific example of what is meant by $J$ for these computer vision problems. — Brian Borchers, Apr 12 '18 at 14:48
A comment that seemingly hasn't been done is that this approximation leads to $H^{-1}J=(J^TJ)^{-1}J$ which is the pseudo-inverse matrix (that amounts to a projection). — Jean Marie, Sep 09 '19 at 09:24

Nick Alger · Answer 1 · 2019-08-06T23:36:48.013

It has been noted by many researchers that the Gauss-Newton Hessian often performs much better than theory predicts. For example, one can show that the Gauss-Newton Hessian will perform well when the residual is small, but it tends to perform well when the residual is large, too. Since the Gauss-Newton Hessian is positive definite, one may expect the Gauss-Newton method to outperform Newton's method when moving through regions of space where the true Hessian is indefinite, but often the Gauss-Newton method outperforms Newton's method even regions where the Hessian is positive definite. No one knows why this is (to my knowledge), but the following paper proposes an explanation and has some more general discussion and literature review of the phenomenon:

Chen, Pei. "Hessian matrix vs. Gauss–Newton hessian matrix." SIAM Journal on Numerical Analysis 49.4 (2011): 1417-1435. https://epubs.siam.org/doi/abs/10.1137/100799988

I've been trying to understand this myself. I suspect a complete understanding will come from analysis of the continuous flows associated with the Newton and Gauss-Newton methods. That is, the continuous limit of the (Gauss-)Newton method as the step size goes to zero and the number of steps goes to infinity. Here are some pictures I made comparing the continuous flows associated with the Gauss-Newton method, Newton method, and some others, on a version of the Rosenbrock function that is regularized so that there is a residual at the optimal point.

score 3 · Answer 2 · answered Jun 29 '19 at 19:16

As you noted, when solving $ \min_\phi \frac{1}{2}||y - f_\phi||^2 $, the use of $H\approx J^TJ$ is reasonable (following from the Taylor expansion of the objective).

This is considered in Why is the approximation of Hessian=$J^TJ$ reasonable?

So, you didn't post any examples of computer vision papers using this approximation, but I'm going to take a stab in the dark regarding a closely related result.

The reason is that many problems in computer vision can be written as optimizing the log-likelihood function $\mathcal{L}(\theta|X)=\log p(X|\theta)$ of a model with parameters $\theta$ given data $X=(x_1,\ldots,x_n)$. Having observed some $x$, we can estimate a log-likelihood $\log p(x|\theta)$. E.g., given some pixels and an explanatory model, we can assume Gaussian noise so that the log-likelihood is then the negative squared error (see here or here). So to fit a model, we usually optimize something like $$\min_\theta \mathbb{E}_X\left[\mathcal{L}(\theta|X)\right] \approx \min_\theta \frac{1}{n} \sum_i \log p(x_i|\theta). $$

So we essentially want to optimize $L_\theta(x) = \log p(x|\theta)$ with respect to $\theta$. The gradient and Hessian are given by $$ v_x(\theta)=\nabla L_\theta(x) \;\;\;\;\&\;\;\;\; \mathcal{H}_x(\theta) = H[L_\theta(x)] \;\text{ where}\; \mathcal{H}_{ij}=\frac{\partial v_j}{\partial\theta_i} =\frac{\partial^2 L_\theta}{\partial\theta_i\partial\theta_j} $$ which we can use for minimization. Let $v_x(\theta) \in\mathbb{R}^{n\times 1}$ here.

Then the Information Matrix Equality holds, written $$ -\mathbb{E}_x[\mathcal{H}_x(\theta)] = \mathbb{E}_x[v_x(\theta) v_x(\theta)^T] $$ where $\theta$ is the true parameter vector.

See also here and here. The relationship to "information" is because the Fisher information matrix $\mathcal{I}(\theta)$ is closely related to the Hessian $H(\theta)$. Note the model must be correctly specified.

Its use in optimization for maximum likelihood model fitting is discussed in e.g. Mai et al, On Optimization Algorithms for Maximum Likelihood Estimation. Since (1) we use a Monte Carlo estimate of the expectation, (2) the model is likely mis-specified (meaning even at the optimum the equality will not hold), and (3) during optimization $\theta$ may be far from the optimum anyway.

Approximation of Hessian=$J^TJ$ for general non-linear optimization problems

2 Answers2

Linked