3

Sorry in advance if this question sounds too broad or a little bit too obvious.

I know for sure that gradient descent, i.e., the update equation

$x_{k+1} = x_k - \epsilon_k \nabla f(x_k)$

converge to the unique minimizer of $f$ with domain $\text{dom}(f) = \mathbb{R}^n$ whenever $f$ is strictly or strongly convex.

However, I could not remember if it converges to a minimizer in convex functions, and how it achieves this convergence.

What is bothering me is that

  1. I've seen some conflicting results where instead of $x_k$, an averaged sequence $\hat x_{k} = \frac{1}{K} \sum_k x_k$ converges.

  2. I've also seen conflicting results where the step size is decreasing $o(1/k)$ vs it is constant.

  3. There is also the issue of weak vs strong convergence. I'm not sure what this means exactly.

I have know some results but they are for quadratic functions, not for convex functions in general.

Can someone chime in on what this basic result in optimization look like?

  • Regarding (3): I think that weak convergence only makes a difference if you are working in an infinite-dimensional vector space. but here you specified $\mathrm{dom}(f) = \mathbb{R}^n$. – VHarisop Jan 19 '21 at 23:05
  • 2
    For convergence of gradient descent, one typically assumes that $\nabla f$ is Lipschitz continuous. Without this assumption, you might fail to have convergence (imho). – gerw Jan 20 '21 at 06:32
  • @gerw smoothness is enough to have $\nabla f(x_k) \to 0 .$ – Red shoes Mar 09 '24 at 00:09

1 Answers1

2

In https://stanford.edu/~boyd/papers/pdf/monotone_primer.pdf, section 5, subsection "Gradient method" or "Forward step method" shows a proof for functions that are strongly convex with Lipschitz gradient, using constant step sizes.

If the function is convex with Lipschitz gradient then https://www.stat.cmu.edu/~ryantibs/convexopt-F13/scribes/lec6.pdf shows a proof of convergence with constant step sizes.

For strictly convex functions (not Lipschitz), gradient descent will not converge with constant step sizes. Try this very simple example, let $f(x) = x^{4}$. You will see that there is no constant step size for gradient descent that will converge to $0$ (for any initial condition). In this case, people use diminishing step sizes. I've never seen results about convergence of the average but it might be what happens when you use a constant step size on functions that are strictly convex and don't have Lipschitz gradient.

Unless $x$ lives in an infinite-dimensional space, weak and strong convergence is the same and I wouldn't worry about it. Here is additional reading if you are interesting, https://people.maths.bris.ac.uk/~mazag/fa17/lecture9.pdf

dgadjov
  • 1,349
  • 1
    I should have been more precise. Gradient descent will not converge for any initial condition with constant step size. Given $\epsilon$ if the intial condition is set to $x_{0} = \frac{1}{\sqrt{2\epsilon}}$ you will see that $x_k$ just alternates from $\frac{1}{\sqrt{2\epsilon}}$ to $\frac{-1}{\sqrt{2\epsilon}}$. – dgadjov Mar 16 '21 at 05:31
  • The third paragraph is very misleading. If there is no global gradient Lipschitz constant, then of course one cannot expect one stepsize would work for arbitrary initial point. More interesting question is for each initial point, can we prove there exists $\alpha$ such that if $\epsilon_k\equiv\epsilon\leq \alpha$, the gradient descent converges. – William Jun 04 '22 at 05:14
  • Not sure how the third paragraph is misleading. The statement is true and it highlights the edge cases that cause some algorithms to fail. It would be misleading if I said those algorithms work and ignore the edge cases. Not really sure how your question is more interesting. If you know that there exists an $ – dgadjov Jun 04 '22 at 16:24
  • 1
    Yes, your statement is true if you want to say GD may not converge for all initial points with a single stepsize, but it does not mean GD fails on convex function with locally Lipschitz gradient. It only implies that for different initial points you need different stepsizes (but this stepsize is still a constant given an initial point). In practice, it's the case that you choose an initial point and then take a small enough constant stepsize, and if I can find an upper bound of stepsize so that for all stepsizes no larger than this bound, GD converges, then you cannot say GD fails. – William Jun 04 '22 at 19:26
  • 1
    "but it does not mean GD fails on convex function with locally Lipschitz gradient", who said that it does. I said that GD fails for (the class of all) strictly convex functions that aren't Lipschitz. You are talking about the class of convex functions that are locally Lipschitz. You are comparing apples with oranges. I can say GD fails cause I can show you examples where it doesn't work. I could have picked $f(x) = \frac{3}{2}|x|^{\frac{3}{2}}$ and it would fail with constant step size for any size choosen. – dgadjov Jun 04 '22 at 23:41
  • Thanks for your clarification. I misunderstood your point, indeed, if you consider function whose gradient is even not locally Lipschitz, then yes, there is a counterexample. – William Jun 05 '22 at 13:29