1

If I would do loss = loss/10 before calculating the gradient would that change the amount of change applied to the model parameters during back propagation?

Or is the amount of change only dependent on the direction of the gradient and the learning rate?

I'm especially interested in how this would work in pytorch.

1 Answers1

2

By the chain rule, scaling the loss by a scalar value c, ie loss = c*loss, will cause all gradients computed via backprop to also be scaled by c. ie loss -> c*loss, grad -> c*grad

Scaling the gradients by c changes the magnitude of the gradient vectors but not the direction.

In a gradient descent context, scaling the gradients by c is equivalent to scaling the learning rate by c. ie:

loss = ...
w_new = w_old - lr * grad

becomes

loss = c*loss
w_new = w_old - lr * c * grad # scaling loss by c -> scaling grad by c
w_new = w_old - lr_scaled * grad # lr_scaled = lr * c
Karl
  • 1,176
  • 5
  • 7