Convergence Rate and Complexity of Single SGD update for Neural Networks

Question

Setting:

Fix positive integers $m_1,\dots,m_n$ an activation function $\sigma:\mathbb{R}\rightarrow\mathbb{R}$ be $C^1$ with Lipschitz derivative, and let $NN$ denote the set of all feed-forward neural networks for the form $$ f_{\theta}(x):= A_n\circ \sigma \dots \circ \sigma \circ A_1, $$ where $A_i(x)=M_ix+b_i$ where $M_i$ is an $m_{i+1}\times m_{i}$ matrix and $b_i$ is a vector in $\mathbb{R}^{m_{i+1}}$, and $\theta=(M_1,\dots,M_n,b_1,\dots,b_n) \in \mathbb{R}^L$.

Note: $\sigma$ is applied component-wise.

Let $l:\mathbb{R}^{m_n}\rightarrow [0,\infty)$ be a $C^1$ function with Lipschtiz gradient, let $\{x_i\}_{i=1}^N$ be elements of $\mathbb{R}^{m_1}$, and let $$ L(\theta)\triangleq \frac1{N}\sum_{i=1}^N \ell(f_{\theta}(x_i)) . $$

Question:

Suppose we want to optimize $L(\theta)$ over $\mathbb{R}^L$ using (resp. mini-batch) stochastic gradient descent (SGD). Then I have the following question:

What is the complexity of computing a single gradient update? This is probably a function of the data size, no?
Is the convergence rate of the (resp. mini-batch) SGD a function of $N$ (and/or the batch-size $b$)?

Intuitively one of these two must be a function of the data-size (and the mini-batch size).

I'm looking for non-empirical result and I don't want to assume that $\sigma$ is an affine map (like in the classical MLP).

Thanks, but still I don't believe the post contains anything about data-size.. — , Mar 18 '20 at 15:02
I seem to have misunderstood. Can you clarify on what you mean by dependence on the data size? I presume you are talking about solving an empirical risk minimization problem where the empirical risk is calculated using N data points. In that case, the number of steps of SGD taken to converge to a solution of the ERM (when each step uses only one point to compute the stochastic gradient) can be determined by looking at the results in the papers in the above post — ProAmateur, Mar 18 '20 at 15:17
In particular, looking at this paper, note that the only term that changes with the number of data points used for the stochastic gradient estimate is the variance of the stochastic gradient $\sigma^2$. You can compute it's dependence on the number of data points $N$ — ProAmateur, Mar 18 '20 at 15:22
I think the dimension of $M_i$ must be $m_{i+1} \times m_i$. — ProAmateur, Mar 25 '20 at 20:36

ProAmateur · Accepted Answer · 2020-03-25T21:03:17.743

Each SGD iteration involves computing $\nabla_{\theta} \ell(f_{\theta}(x_j)) = d\ell(f_{\theta}(x_j)) \nabla_{\theta}f_{\theta}(x_j)$ for some (randomly chosen) index $j \in \{1,\cdots,N\}$. Therefore, the cost of this step equals the cost of evaluating $f_{\theta}(x_j)$ and its gradient. The cost of evaluating the function $f_{\theta}(x_j)$ is $O(m_1m_2 + m_2m_3 + \cdots + m_{n}m_{n+1})$. To see this, note that evaluating $\sigma \circ A_1(x)$ involves multiplying a matrix of dimension $m_2 \times m_1$ with a vector of dimension $m_1$, then adding two vectors of dimension $m_2$, and finally composing with the function $\sigma$ a total of $m_2$ times. During this computation, we can store the intermediate values $\sigma \circ A_1(x), \sigma \circ A_2(\sigma \circ A_1(x))$, and so on. From automatic differentiation, we know that the cost of gradient evaluation is a constant factor times this quantity, see Theorem 4.1 in this reference for example. If the minibatch size is $B$, then the per-iteration cost of minibatch SGD is $B$ times this quantity (assuming that gradients and function values are not reused even if duplicate choices of $j$ are present in the minibatch; even otherwise, the cost is roughly the same on average).
Let $g_i(\theta) := \nabla_{\theta} \ell(f_{\theta}(x_i))$. Then, the gradient at $\theta$ equals $\frac{1}{N}\sum_{i=1}^{N} g_i(\theta)$. From Corollary 2.2 of this paper, we have that the "level of suboptimality" (measured in term of the norm of the gradient for nonconvex problems) after iteration $T$ is $O\left(\dfrac{1}{T}\right) + O\left(\dfrac{\sigma}{\sqrt{T}}\right)$ where $\sigma > 0$ is the standard deviation of the stochastic gradient (sorry for the clash in notation). From basic statistics, we know that the variance of $g_j(\theta)$ for $j$ chosen uniformly at random from $\{1,\cdots,N\}$ does not depend on the number of samples $N$. In other words, if we use a random draw from a probability distribution to estimate its mean, then the variance of this single draw-based sample mean does not depend on the size of the population. Therefore, the rate of convergence of the stochastic gradient method does not depend on the sample size $N$. If we use a minibatch of size $B$ of i.i.d. draws, then the standard deviation shrinks by a factor of $\sqrt{B}$ (again from basic statistics on the sample mean), so the rate of convergence now becomes $O\left(\dfrac{1}{T}\right) + O\left(\dfrac{\sigma}{\sqrt{BT}}\right)$. Note that if we use a full gradient evaluation, then $\sigma = 0$ and we get the faster rate of convergence $O\left(\dfrac{1}{T}\right)$. Note that in some cases, we don't even have this difference between stochastic and deterministic methods, see the analyses in Sections 3.3 and 3.4 here, for instance.

But what is the cost of a single update of $f_{\theta}$? For example, if $\tilde{m_i}=2*m_i$ (in the definition of our architecture) then clearly the "cost" of performing one single update should be at-least doubled? — , Mar 25 '20 at 20:19
@ProbablyAHuman To evaluate the cost of computing $\nabla_{\theta}f_{\theta}(x_j)$ in terms of elementary operations, use the chain rule to expand this gradient out and calculate the number of additions. If the dimensions are increased, the number of additions will also increase as you mention. I will add more details soon — ProAmateur, Mar 25 '20 at 20:23

Convergence Rate and Complexity of Single SGD update for Neural Networks

1 Answers1