12

This is one of those questions where I know I am wrong, but I don't know how.

I understand that when training a neural network, we calculate the derivatives of the loss function with respect to the parameters. I also understand that these derivatives indicate the instantaneous rate of change in the loss when the parameters are modified. However, after this, we update the parameters by taking a small finite step, determined by the learning rate.

My question, then, is: what is the point of calculating the instantaneous rate of change if we are actually interested in the change in loss from a finite step? Couldn't we calculate the derivative with respect to the finite step instead?

Leo Juhlin
  • 123
  • 5

4 Answers4

11

Good question! Your reasoning absolutely makes sense. Your suggestion is not unreasonable. However, there are two ways that standard gradient descent is better than what you suggest:

  1. Taking a fixed-length step (with the length determined by the learning rate) is not optimal. Conceptually, if the gradient is small, this often indicates that we are close to the optimum, and if the gradient is large, we're less likely to be close to the optimum. So it makes sense for the size/length of the step to be proportional to the magnitude of the gradient: when we're close to the optimum, we want to take a small step (to avoid overshooting), and when we're far away, we want to take a large step (to converge more quickly). Therefore, picking a fixed step size is not the best. Standard gradient descent uses a step size (i.e., the magnitude/length of the step that it takes) that is proportional to the product of the learning rate and the magnitude of the gradient. This works better, because we take a large step when we're far from the optimum and a small step when we're close.

  2. There is no efficient way to find the optimum direction to go, for a given finite step size. In other words, if we want to find a change $d$ to the weights that minimizes $\text{Loss}(\Theta+d)$, subject to $\|d\|=c$, there is no obvious algorithm for computing such $d$. You could try to compute such a thing, but there's no obvious algorithm to compute it. In comparison, the gradient can be computed efficiently, and in some sense, gradient descent is a good approximation to the kind of thing you are proposing.

In short, gradient descent makes sense because it's a good approximation to the optimal direction, and one that can be computed efficiently. Or, to put it another way, if you look in the close vicinity of a point $\Theta$, then the loss function is approximately linear, so the gradient (which speaks about infinitesimal change) is approximately the same what you'd get if you considered a fixed-size change.

Finally, a last benefit of gradient descent is that, thanks to the magic of backpropagation, computing the gradient is very efficient and with one computation tells you about how the function changes in all directions. This is much more efficient than separately computing, for each direction, which is the best direction to go. In particular, you can compute the gradient in $O(n)$ time, whereas computing something separately for each dimension would require $O(n^2)$ time (where $n$ is the number of dimensions in $\Theta$, i.e., the number of parameters in the model). Since current models have millions or even billions of parameters, this makes a huge difference. Training current models wouldn't be possible without gradient descent.

D.W.
  • 3,651
  • 18
  • 43
10

I think the source of your confusion is this statement: we update the parameters by taking a small finite step, determined by the learning rate

This is not correct. We take a small finite step determined by the gradient. The learning rate simply scales the magnitude of the step along that gradient.

Say we have model weights θ (keep in mind θ here is a vector). We want to minimize Loss(θ). We need to determine relative update magnitudes/directions for each element in θ. The best way we know how to do this is by taking the gradient ∇Loss(θ). This allows us to update the model via θ' = θ - ∇Loss(θ). Empirically we find this converges poorly, so we add a learning rate α to scale the magnitude of the update - θ' = θ - α∇Loss(θ). Note that the direction of the update is determined by the gradient, α simply scales the gradient value.

You can think of this as a first order Taylor series approximation. We can estimate Loss(θ + Δθ) ≈ Loss(θ) + ∇Loss(θ) * Δθ.

You ask if we can calculate the derivative with respect to the step itself - actually we can, using a second order Taylor series approximation. We can estimate Loss(θ + Δθ) ≈ Loss(θ) + ∇Loss(θ) * Δθ + ∇∇Loss(θ) * Δθ.

However we tend to run into practical issues doing this. Say our vector θ has n elements. Then ∇Loss(θ), the vector of first derivatives (the Jacobian) also has n elements. Unfortunately ∇∇Loss(θ), the second derivatives (the Hessian) is a matrix with n,n elements. The increase in computational complexity makes second-order methods functionally impossible for large models.

Second order methods are generally better than first order methods, but first order methods are more widely used due to tradeoffs between optimization performance and computational complexity.

Karl
  • 1,176
  • 5
  • 7
8

You are right that if we have some step size $\eta>0$ in mind, the gradient will not be the optimal direction to move in. Say that $l$ is our loss and $w$ are our neural network weights. Then there is in fact some optimal direction: $$ \delta^* = \underset{\delta\in\mathbb{R}^P,\Vert\delta\Vert=1}{\textrm{argmin}} \, l(w- \eta\delta) \,, $$ and this will give better loss than the gradient descent direction with step size $\eta$, that is, $l(w-\eta\delta^*)\leq l(w-\eta\frac{\nabla_w l(w)}{\Vert\nabla_w l(w)\Vert})$. If $\eta$ is sizable, it might be much better!

So why do we use the gradient instead of using $\delta^*$? It's because finding $\delta^*$ is about as hard as finding the optimal neural network parameters $w$ in the first place: to find $\delta^*$ we would still need to do an optimization over $l$, which is what we are currently trying to solve!

We use the gradient not because it is the optimal direction to move in for fixed step size $\eta$: it isn't. We use the gradient because it can be easily computed. In many realistic problems, the gradient actually does a bad job of telling us what direction to move in for larger $\eta$; this is where the idea of "preconditioning" comes from, and indeed this is (in part) what motivates the cottage industry of alternative first order optimization methods which has proliferated in ML research, such as Adam, RMSgrad, etc.

By the way, why is it that the gradient is easily computed but this optimal step direction isn't? This is a manifestation of a general phenomenon about high dimensional complex functions: only "local" information about such functions is tractable. The gradient depends only on the behavior of a function in an arbitrarily small neighborhood of a point, i.e. it is a local quantity. On the other hand, using $\delta^*$ depends on the behavior of the function on some sphere of fixed, nontrivial diameter, and hence it is intractable. For similar reasons, the integral of a high dimensional complex function is computationally intractable.

7

The ambiguity may come from what you call "finite step". If you have a function f depending on a variable x and you want to find the minimum, you would compute its derivative with respect to x (the abscissa on a 2D plot and what you may call the axis/dimension of a finite step).
In Deep Learning, the function f to minimize is the Loss and the variable x is/are the parameters theta (usually a vector and not a scalar) so the "step" is done along the theta axes (scaled by the learning rate).

If we imagine that the loss is a function giving the altitude on a map according to 2 directions south-north (theta1) and east-west (theta2), the opposite of the gradient gives us the direction where to walk according to the deepest slope around us to go to the valley or low point.
As the gradient magnitude could be big we use a scaling factor called learning rate so that we don't overshoot and make iteratively small steps (-Lr * gradient(Loss)) in the direction of the lowest point (hopefully, as this method can lead to a local low and not a global one depending on the map/loss function and the point on the map where the random initialization makes us start).

rehaqds
  • 1,801
  • 4
  • 13