1

I don't understand the following statement:

The choice of learning rate m does not matter (for Perceptron) because it just changes the scaling of w (weights). The site with this statement that wikipedia cites.

The update rule of perceptron is $w \pm x$ for mistaken cases. If the update to the weight vector is addition or subtraction, how can it only scale the weights?

2 Answers2

3

After some study, I figured out the answer and want to share with people if someone also finds it helpful. The loss function of Perceptron is hinge loss or

$J(w) = max(0, -yw^Tx)$.

Adding a constant to the loss function does not change the function value, as it does not change the sign of the decision. In other words

$J_2(w) = max(0, -\alpha yw^Tx) = J(w)$.

If we do gradient descent using $J_2$, we have

$\partial(J_2)/\partial(w) = 0$, if $J_2 = 0$;

$\partial(J_2)/\partial(w) = -\alpha yx$, otherwise.

So the update function of gradient descent is

$w_{new} = w_{old} \pm \alpha x$.

As long as $\alpha > 0$, it does not change Perceptron decision in any step. This is why for Perceptron, you only need to set learning rate to be 1.

Specifically answer the question, when people say "the learning rate only scales $w$", they are referring to $J_2(w) = max(0, -\alpha yw^Tx)$ rather than $w_{new} = w_{old} \pm \alpha x$.

A related question I found very helpful is Normalizing the final weights vector in the upper bound on the Perceptron's convergence

0

The learning rate regulates the amount by which the weights will change at every step $t$ of the gradient descent algorithm.

It is not true that it can be set to any arbitrary amount, this is furthest from the truth. A learning rate which is too small will never allow your machine learning model to converge to a minimum, and a learning rate which is too large will cause your model parameters to oscillate around a possible minimum.

I answered in more details here how the basic gradient descent algorithm is affected by the choice of the learning rate and then i propose some better alternative methods: Do adaptive learning optimizers follow the steepest decent?.

More details

More accurately the update rule for gradient descent is

$w^{t+1} = w^t + m\nabla J(w)$

where $m$ is the learning rate and

$J(w) = \frac{1}{2}\sum(y-\hat{y})^2$

is the cost function. Thus, it does not scale the weights in a multiplicative sense, however it does scale the correction the weights will experience at each step of the algorithm.

JahKnows
  • 9,086
  • 31
  • 45