Understanding stochastic gradient descent and derived methods

Question

I want to understand the theory behind stochastic gradient descent. That is, questions such as: when and why does the algorithm converge with high probability? At what rate does it converge? I am specifically interested in the method as it applies to typical neural network training, where one divides a dataset into batches and then computes the gradient on each batch. How does the variance of the gradient depend on batch and how does this affect training?

There are many recent variants of stochastic gradient descent, such as ADAM, adding momentum, etc. There are also many questions that I think are still not completely understood, such as when does the algorithm gets stuck in a local minimum and how this affects performance.

In any case, I would like to grasp the theory as it is understood at present. Ideally I'd begin by reading a textbook on the subject, even if it is not up to date.

So what textbooks / papers can you suggest for this purpose? What keywords would you use to search for the relevant literature on google scholar?

score 1 · Answer 1 · answered Apr 03 '20 at 00:02

1

See this post for some results in the nonconvex setting. The literature has exploded in the last decade - this talk covers results relevant to machine learning applications. This webpage has a draft book and some lecture slides on the topic by a prominent researcher in the area.

answered Apr 03 '20 at 00:02

ProAmateur

1,828

Thanks for the pointers! – a06e Apr 03 '20 at 11:17

Understanding stochastic gradient descent and derived methods

1 Answers1