I want to understand the theory behind stochastic gradient descent. That is, questions such as: when and why does the algorithm converge with high probability? At what rate does it converge? I am specifically interested in the method as it applies to typical neural network training, where one divides a dataset into batches and then computes the gradient on each batch. How does the variance of the gradient depend on batch and how does this affect training?
There are many recent variants of stochastic gradient descent, such as ADAM, adding momentum, etc. There are also many questions that I think are still not completely understood, such as when does the algorithm gets stuck in a local minimum and how this affects performance.
In any case, I would like to grasp the theory as it is understood at present. Ideally I'd begin by reading a textbook on the subject, even if it is not up to date.
So what textbooks / papers can you suggest for this purpose? What keywords would you use to search for the relevant literature on google scholar?