Steepest-descent optimization procedure with step size given by harmonic sequence

Question

Here is a minimization procedure I've "dreamed up." I'm hoping to gain a better understanding of its mathematical properties and practical efficiency.

Given a (locally) convex function $f(x):{\mathbb{R}}^n \to \mathbb{R}$, initial $x_1$, initial step size $a_1$, and tolerance $\delta$:

If $\lVert\nabla f(x_k )\rVert<\delta$, return $x_k$; otherwise:
Pick step direction $d_k \equiv -\nabla f(x_k )/\lVert\nabla f(x_k )\rVert$.
Pick step size $a_k$.
Let $x_{k+1} \equiv x_k +a_k d_k$.
Let $a_{k+1} \equiv a_1 /k$.
Let $k\equiv k+1$ and return to step 1.

Most optimization procedures require you to do some kind of line search after picking the step direction, but this algorithm avoids that computation by simply choosing an arbitrary $a_1$ and letting it decrease as the function iterates. Since

$$a_k =\frac{1}{k}$$

the step size approaches $0$ in the limit $k\to \infty$ and the sequence of iterates $\left\{ x_k \right\}$ is convergent. On the other hand, since the sum

$$\sum_{k=1}^{\infty } a_k =a_1 \sum_{k=1}^{\infty } \frac{1}{k}$$

is divergent, the cumulative sum of the step sizes is infinite, so assuming convexity, we will never get "stuck" at an $x$ far from $x^*$. (I am unsure of how to prove this formally.)

The above properties also apply for a more general algorithm where, in step 5, we let $a_{k+1} \equiv a_1 /k^t$ with $t\in (0,1]$.

Is there a name for this optimization procedure? What are its convergence properties? How should one select the initial values $x_1$ and $a_1$ in the general case?

Here is a proof-of-concept implementation in Matlab. Since we have to compute the gradient numerically, I have it evaluate the gradient over a "neighborhood" of size nsize around $x_k$. nsize is initialized to 0.01 and decreases by a factor of $k$ at each iteration, which prevents cycling.

[x, y] = minimize2d(@obj, -1.34, 1.79, 1, 0.01, 10e-15);
x_star = x(end)
y_star = y(end)
f_star = obj(x_star, y_star)
[x_plot, y_plot] = meshgrid(linspace(-1.6, 0.3, 51),linspace(.9, 1.9, 51));
z_plot = obj(x_plot, y_plot);
contour(x_plot, y_plot, z_plot, 10)
hold on
plot(x, y, "-k")
scatter(x_star, y_star)
hold off
function f = obj(x, y)
    f = 4x.^2 + exp(1.5y) + exp(-y) - 10*y;
end
function [x, y] = minimize2d(fun, x0, y0, a0, Nsize, tol)
    x = x0; y = y0; a = a0;
grad_magnitude = tol + 1;
i = 1;

while grad_magnitude &gt; tol
    a = a0 / i;
    Nsize = Nsize / i;
    [xN, yN] = meshgrid(linspace(x(i)-Nsize, x(i)+Nsize, 3), ...
        linspace(y(i)-Nsize, y(i)+Nsize, 3));
    f = fun(xN, yN);
    [px, py] = gradient(f);
    grad_magnitude = norm([px(2) py(2)]);
    step = -a * [px(2), py(2)] / norm([px(2) py(2)]);
    x(i+1) = x(i) + step(1);
    y(i+1) = y(i) + step(2);
    i = i + 1;
end
nit = i

end

Output:

nit = 16
x_star = -7.5968e-06
y_star = 1.2651
f_star = -5.6986

When k is small $\Delta x$ might overshoot the minimum. The line search methods ensure they only move to a smaller error / better approximate solution at the cost of more calculations. — , Jul 04 '20 at 05:01
This is similar to simulated annealing, which rather than using gradient descent and decreasing step sizes, uses possibly random movements and decreasing probability of movement in worse directions. — Simply Beautiful Art, Dec 12 '20 at 19:01

score 1 · Accepted Answer · answered Jul 04 '20 at 02:02

Upon finishing writing my answer, I realized that I misread your "step 2." What I write below is for a version of the algorithm where $d_k = -\nabla f(x_k)$, so that the magnitude of the gradient affects the actual step. I will still refer to $a_k$ as the "step size." I understand this is a bit different that the algorithm you have written, but I hope the answer is still helpful anyway.

This is essentially gradient descent where you have chosen a specific sequence of step sizes. Your "step 1" is a stopping criterion in place of "stop when $\nabla f(x_k)= 0$" to account for numerical imprecision.

There are many resources discussing the properties of gradient descent; here is a course with notes and here is a text. There you can find convergence results that depend on your assumptions on $f$. In some cases, a constant step size can get you a $O(1/\sqrt{k})$ error rate, while in special circumstances, a decreasing step size can guarantee a faster $O(1/k)$ error rate. I am being purposefully vague here because you need to introduce various technical notions to state these results precisely.

Finally, your observation about your step sizes diverging is something that Robbins and Monro observed for stochastic methods. In that context, the intuition is that the divergence condition $\sum_k a_k = \infty$ ensures that you have enough "gas" to explore the space, while the convergence condition $\sum_k a_k^2 < \infty$ ensures that your steps are decreasing sufficiently fast enough so that you can hone in on the solution instead of jumping wildly all over the place. Again, this is in the context of stochastic methods; I am not sure this intuition holds for non-stochastic methods like gradient descent.

Does it follow that any minimization procedure for which (1) each step direction is a descent direction, (2) the sequence of step lengths converges to zero, and (3) the series of step lengths diverges, will converge on the optimum for any convex function? If so, it seems like this could be the basis for a wide variety of stochastic methods, as you point out. — Max, Jul 04 '20 at 06:48

Steepest-descent optimization procedure with step size given by harmonic sequence

1 Answers1

Linked