Gradient Descent: Is the magnitude in Gradient Vectors arbitrary?

Question

I am only just getting familiar with gradient descent through learning logistic regression. I understand the directional component in the gradient vectors is correct information derived from the slope of the cost with respect to the weights. What about the magnitude, why is it arbitrary? because if this magnitude is not even marginally proportional to the distance from the minimum, this means that it is arbitrary. As for the direction, the magnitude is not needed, because the vector already tells us the direction.

I understand that this magnitude is the derivative of the cost function, which reflects the rate of change of the cost function at a given point. what information does this rate of change of the cost function at a given point tell us then, is it about the distance to the minimum? but this can't be because local steepness (which is what the gradient magnitudes represent) tells us nothing about the distance to the minimum because the slope can be very steep at one point and the minimum could be right next to it. If it is arbitrary then we wouldn't use it along with the learning rate, but I don't understand the information we are getting from this value.

(ChatGPT just answered by giving circular explanations so it wasn't much help)

score 2 · Answer 1 · answered Aug 12 '23 at 22:07

Summary

The magnitude is not arbitrary, but holds little information about the distance to the minimum.

Details

Actually, the "magnitude" of the gradient expresses the steepness of the loss at the current point. It expresses, how much the value of the loss would change if one does a unit-length-step (if the gradient would not change within the length of this step).

Unfortunately, the steepness / magnitude holds little information about the distance to the next local minimum.

The following example illustrates this. Given a loss function (blue) over a single parameter (x-Axis), we see two points of interest. While the left is closer to the minimum (on the x-Axis / in the parameter space), it is also the point with the steeper gradient (black).

Another argument, why the magnitude is not helpful, can be seen considering two loss functions, where the second loss function is always the double of the first one, e.g. $$L_1(y,\hat{y}) = \frac{1}{2}(y-\hat{y})^2,\qquad L_1(y,\hat{y}) = (y-\hat{y})^2$$ In that case, all gradients of $L_2$ would also be the double of the gradients of $L_1$, while the distance to the minima would be the same. So the difference in magnitude between gradients of $L_2$ and $L_1$ holds no information about the distance to the minima.

As Rodrigo Ferro mentioned, there are ways to modify Gradient Descent. An obvious way would be to use Normalized Gradient Descent, i.e. keep the direction, but do a fixed-length step (given by the learning rate), independent of magnitude of the gradient. This can be achieved by dividing the gradient by its magnitude.

A detailed explanation of other gradient descent variants would be out of scope of this question.

Rodrigo Ferro · Answer 2 · 2023-08-12T21:28:57.350

No, the magnitude of gradient vectors is not arbitrary. As you pointed out, the gradient of a function at a particular point indicates the direction of steepest ascent. For some functions, like the Mean Squared Error (MSE), there's typically a single global minimum, and a larger magnitude does suggest that we are far from this minimum. However, in the context of deep learning, where loss landscapes can have multiple peaks and valleys, "vanilla" gradient descent is less commonly used. Instead, augmented or modified variants of gradient descent, such as Stochastic Gradient Descent (SGD), Mini-batch Gradient Descent, and methods with Adaptive Learning Rates, are employed.

You can also find some interesting points here.

Gradient Descent: Is the magnitude in Gradient Vectors arbitrary?

2 Answers2

Summary

Details