You cannot use Adam alone so you have to use Adam(optimizer) with SGD(Stochastic Gradient Descent), BGD(Batch Gradient Descent) or MBGD(Mini-Batch Gradient Descent) which is the way of how to take data from dataset to do gradient descent with the optimizers such as Adam, RMSprop, Adadelta, Adagrad, etc.
Stochastic Gradient Descent(SGD):
- is the way of how to take data from dataset to do gradient descent with the optimizers such as Adam, RMSprop, Adadelta, Adagrad, etc.
- can do gradient descent with every single sample of a whole dataset one sample by one sample, taking the same number of steps as the samples of a whole dataset in one epoch. For example, a whole dataset has 100 samples(1x100), then gradient descent happens 100 times in one epoch which means model's parameters are updated 100 times in one epoch.
Adam(Adaptive Moment Estimation)(2014):
is the optimizer which can do gradient descent by automatically adapting learning rate to parameters, considering the past and current gradients, giving much more importance to newer gradients than Momentum(1964) with EWA to accelerate convergence by mitigating fluctuation.
*Memos:
- The learning rate is not fixed.
- EWA(Exponentially Weighted Average) is the algorithm to smooth a trend(to mitigate the fluctuation of a trend), considering the past and the current values, giving more importance to newer values.
- EWA is also called EWMA(Exponentially Weighted Moving Average).
is the combination of Momentum(1964) and RMSProp(2012).
uses Momentum(1964)'s EWA instead of RMSProp(2012)'s.