3

I'm a beginner in Machine learning and I'm searching for some optimizer for the gradient descent. I've searched many topics about that, and did a state of art of all these optimizers. I have just one problem, and I can't figure it out. Don't judge me please, but I would like to know?

Are we using ADAM optimizer alone or are we obliged to combine it with the SGD? I didn't understand if it works alone or if it's here to optimize NOT the neural network but the SGD of the neural network?

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34

2 Answers2

1

Adam optimization is an extension of stochastic gradient descent (SGD) optimization.

SGD maintains a single learning rate for all weight updates and the learning rate does not change during training.

Adam optimization can have a different learning rate for each weight and change the learning rate during training.

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113
0

You cannot use Adam alone so you have to use Adam(optimizer) with SGD(Stochastic Gradient Descent), BGD(Batch Gradient Descent) or MBGD(Mini-Batch Gradient Descent) which is the way of how to take data from dataset to do gradient descent with the optimizers such as Adam, RMSprop, Adadelta, Adagrad, etc.

Stochastic Gradient Descent(SGD):

  • is the way of how to take data from dataset to do gradient descent with the optimizers such as Adam, RMSprop, Adadelta, Adagrad, etc.
  • can do gradient descent with every single sample of a whole dataset one sample by one sample, taking the same number of steps as the samples of a whole dataset in one epoch. For example, a whole dataset has 100 samples(1x100), then gradient descent happens 100 times in one epoch which means model's parameters are updated 100 times in one epoch.

Adam(Adaptive Moment Estimation)(2014):

  • is the optimizer which can do gradient descent by automatically adapting learning rate to parameters, considering the past and current gradients, giving much more importance to newer gradients than Momentum(1964) with EWA to accelerate convergence by mitigating fluctuation.

    *Memos:

    • The learning rate is not fixed.
    • EWA(Exponentially Weighted Average) is the algorithm to smooth a trend(to mitigate the fluctuation of a trend), considering the past and the current values, giving more importance to newer values.
    • EWA is also called EWMA(Exponentially Weighted Moving Average).
  • is the combination of Momentum(1964) and RMSProp(2012).

  • uses Momentum(1964)'s EWA instead of RMSProp(2012)'s.