54

It seems the Adaptive Moment Estimation (Adam) optimizer nearly always works better (faster and more reliably reaching a global minimum) when minimising the cost function in training neural nets.

Why not always use Adam? Why even bother using RMSProp or momentum optimizers?

PyRsquared
  • 1,666
  • 1
  • 12
  • 18

2 Answers2

41

Here’s a blog post reviewing an article claiming SGD is a better generalized adapter than ADAM.

There is often a value to using more than one method (an ensemble), because every method has a weakness.

Zephyr
  • 997
  • 4
  • 11
  • 20
18

You should also take a look at this post comparing different gradient descent optimizers. As you can see below Adam is clearly not the best optimizer for some tasks as many converge better.