2

I want to implement the momentum algorithm to train a neural network, but I'm uncertain about where the regularization term should be incorporated. For ridge regularization, one option is to have:

$$ m \leftarrow \beta m - \eta \nabla f(W) $$ $$ W \leftarrow W + m - \lambda W $$

and the other option is:

$$ m \leftarrow \beta m - \eta \nabla f(W) - \lambda W $$ $$ W \leftarrow W + m $$

Here, $\beta \in [0,1)$ represents the momentum parameter, $\lambda \geq 0$ is the regularization parameter, $f$ denotes the neural network function and $W$ refers to the parameters to be updated.

Which option makes more sense? I believe option 2 is the more reasonable choice, but I'm not entirely sure.

0 Answers0