4

Currently I am studying logistic regression.

I read online that Gradient Descent is a 1st-order optimisation algorithm, and that Newton's Method is a 2nd order optimisation algorithm.

Does that mean that Gradient Descent cannot be used for multivariate optimisation and that Newton's Method cannot be used for univariate optimisation? Or can Newton's Method be done using a 1st order Taylor polynomial and still be different from Gradient Descent?

These sites are causing me to question:

  • 1
    You can certainly use both gradient descent and Newton's method for either univariate or multivariate optimization. – littleO Nov 27 '19 at 11:36
  • 1
    This means that gradient descent depends only on the derivative of a function, whereas Newton's method depends on the first and the second derivative. This terminology (1st order/ 2nd order) has nothing to do with univariate or multivariate optimisation. – Slup Nov 27 '19 at 11:37

3 Answers3

4
  • Newton's method aims to find the roots of a function $g$. i.e., the $x$ values satisfying $g(x)=0$. The iterative update state is $x_{n+1} = x_n - J_g (x_n)^{-1}g(x_n)$, where $J_g$ is the Jacobian matrix.

  • Gradient descent aims at finding a (local) minimum of a function $f$. A necessary condition for a function $f$'s minimum at $x$ is $\nabla f(x)=0$ (note that it's not sufficient since the critical point can be a maximum or a saddle point too, assuming $f$ to be continuous and diferentiable) and the iterative update step is $x_{n+1} = x_n - \eta\nabla f(x_n)$, where $\eta$ is the learning rate.

  • Hence, the minimization problem $\underset{x}{min} \; f(x)$ can be thought of as a root finding problem $\{x \;|\;\nabla f(x) = 0\}$, assuming that a local minimum exists in some neighborhood, i.e, let $g = \nabla f$ and solve with Newton, with iterative update step being $x_{n+1} = x_n - H_g (x_n)^{-1}\nabla f(x_n)$, where $H$ is the Hessian matrix, hence it becomes a second order method.

  • The Newton's method converges much faster often but it's computationally more expensive.

  • The next couple of animations demonstrate the convergence to (global) minimum value $(0,0)$ for the (convex) function $f(x,y)=x^2+y^2$ with gradient descent (with learning rate $=\eta = 0.05$, takes around $50$ iterations to converge, as shown below) and Newton's method (converges in a single iteration), starting with initial value $(3,3)$.

  • $f = \begin{bmatrix} x^2 \\ y^2 \end{bmatrix}$, $\nabla f = \begin{bmatrix} 2x \\ 2y \end{bmatrix}$ and $H_f = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix} = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}$

  • Gradient descent update:

$\quad \quad \begin{bmatrix} x_{n+1} \\ y_{n+1} \end{bmatrix} = \begin{bmatrix} x_{n} \\ y_{n} \end{bmatrix} - \eta \begin{bmatrix} 2x_{n} \\ 2y_{n} \end{bmatrix} = (1-2\eta)\begin{bmatrix} x_{n} \\ y_{n} \end{bmatrix}$

  • As can be seen from above, in this case, gradient descent converges to the minimum in a single step only when $\eta=0.5$.

  • Newton's method update:

$\quad \quad \begin{bmatrix} x_{n+1} \\ y_{n+1} \end{bmatrix} = \begin{bmatrix} x_{n} \\ y_{n} \end{bmatrix} - \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}^{-1}\begin{bmatrix} 2x_{n} \\ 2y_{n} \end{bmatrix}$

$\quad \quad = \begin{bmatrix} x_{n} \\ y_{n} \end{bmatrix} - \begin{bmatrix} \frac{1}{2} & 0 \\ 0 & \frac{1}{2} \end{bmatrix}\begin{bmatrix} 2x_{n} \\ 2y_{n} \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}$

Gradient Descent

with $\eta=0.05$

enter image description here

with $\eta=0.25$

enter image description here

with $\eta=0.5$

enter image description here

enter image description here

Sandipan Dey
  • 2,249
3

Like in the comments stated; gradient descent and Newton's method are optimization methods, independently if its univariate or multivariate. Gradient descent only uses the first derivative, which sometimes makes it less efficient in multidimensional problems because Newton's method attracts to saddle points. Newton's method uses the curvature of the function (the second derivative) which lead generally faster to a solution if the second derivative is easy to compute. So they can both be used for multivariate and univariate optimization, but the performance will generally not be similar.

  • So does that mean that when you use the 1st-order Taylor polynomial it is no longer Newton's method, that it becomes Gradient Descent? – Hugh Pearse Nov 27 '19 at 12:03
  • in essence, the (directional) second derivative tells us how well we can expect a gradient descent step to perform and uses therefore a second -order Taylor approximation. Gradient descent just moves into the direction of the negative gradient, where the gradient refers to the derivative with respect to a vector. – Steven01123581321 Nov 27 '19 at 12:20
  • @HughPearse You are confusing two related but different goals. With Newton's method you solve an equation by using the 1st derivative. When you optimize, however, you are not looking for a zero of a function, but for a zero of its jacobian. You then need one more derivative, and that's the hessian. See Newton's method in optimization. – Jean-Claude Arbaut Nov 27 '19 at 12:36
  • @Jean-ClaudeArbaut I found this site which shows a Newton's Method on a univariate function using 1st order Taylor polynomial. There is no Hessian or Jacobian until they need to perform multivariate optimisation. Are they using the wrong terminology? – Hugh Pearse Nov 27 '19 at 12:49
  • @Steven31415 For Newton's method to converge, you need both the second derivative to exist, and the initial point must usually be quite close to the solution (the Newton-Kantorovich theorem only guarantees local convergence). Gradient descent is much slower but also more robust. And often computing the hessian is too costly (if there are $n$ variables, the gradient has size $n$, but the hessian has size $n^2$). So it's not all black and white. – Jean-Claude Arbaut Nov 27 '19 at 12:51
  • 2
    @HughPearse Yes. But to see why you have to consider the Newton-Raphson method to solve a system of nonlinear equations in several variables, using the derivative (i.e. the jacobian matrix). When you apply this method to an optimiztion problem, you are given a function $f$ of several variables, and you look for critical points, so the system do solve is $\frac{\partial f}{\partial x_i}=0$, $i=1\dots n$. Thus when you apply Newton's method to this system of equations, you need the second derivative. – Jean-Claude Arbaut Nov 27 '19 at 12:55
  • 1
    Fully agree on your comments @Jean-ClaudeArbaut. The overall conclusion though is that both optimization methods can be used in multivariate and univariate optimization, but are different in their way of finding solutions. – Steven01123581321 Nov 27 '19 at 13:01
  • @Jean-ClaudeArbaut so just to be clear, in a Taylor polynomial with 3 terms (a constant, a linear term and a quadratic term). The first term (the constant) is always written as $$f(x)$$ Say we have a nonlinear cost function $$J(x)$$ in the Taylor polynomial is $$f(x) = J^{\prime}(x), f^{\prime}(x) = J^{\prime\prime}(x)$$ or is it $$f(x) = J(x), f^{\prime}(x) = J^{\prime}(x)$$ – Hugh Pearse Nov 27 '19 at 13:18
  • @HughPearse See Taylor's theorem for multivariate functions. Not as friendly as the one variable case... – Jean-Claude Arbaut Nov 27 '19 at 14:13
1

Let us assume that we are trying to minimize $f(x)$ w.r.t $x$, with $f$ differentiable.

Let $\nabla f(x) = f'(x)$ be the first derivative of $f(x)$ w.r.t $x$, and $\nabla^2 f(x) = f'(f'(x)) = f''(x)$ be the second derivative of $f(x)$ w.r.t $x$.

To minimize $f(x)$ w.r.t $x$, we can use either the Gradient Descent method or Newton's Method, both of them are iterative methods.

$\textbf{Gradient Descent method}$ has the form:

$$ x^{t+1} = x^{t} - \eta (\nabla f(x))$$

where $\eta $ is a step size parameter. Here we use the first only. Thus, it is called a 1st-order optimization method.

$\textbf{Newton's Method}$

$$ x^{t+1} = x^{t} - \frac{\nabla f(x)}{\nabla^2 f(x)} $$

here we use the first derivative and the second derivative, thus this method is called a 2nd-order optimization method.

  • Is learning rate necessary in Newton's Method? Commonly the iteration is given without $\eta$. – Antonios Sarikas Oct 09 '24 at 13:36
  • No, it is not needed. In this case, the learning rate (step size) is adaptively determined based on the function's second dérivative. It was a typo. I updated the formula. – MathLearner Oct 13 '24 at 07:47