I want to implement the momentum algorithm to train a neural network, but I'm uncertain about where the regularization term should be incorporated. For ridge regularization, one option is to have:
$$ m \leftarrow \beta m - \eta \nabla f(W) $$ $$ W \leftarrow W + m - \lambda W $$
and the other option is:
$$ m \leftarrow \beta m - \eta \nabla f(W) - \lambda W $$ $$ W \leftarrow W + m $$
Here, $\beta \in [0,1)$ represents the momentum parameter, $\lambda \geq 0$ is the regularization parameter, $f$ denotes the neural network function and $W$ refers to the parameters to be updated.
Which option makes more sense? I believe option 2 is the more reasonable choice, but I'm not entirely sure.