I was reading the MSE question "Why is the gradient the direction of steepest ascent", and I came across Jonathan's answer, which is copied here:
Consider a Taylor expansion of this function, $$f({\bf r}+{\bf\delta r})=f({\bf r})+(\nabla f)\cdot{\bf\delta r}+\ldots$$ The linear correction term $(\nabla f)\cdot{\bf\delta r}$ is maximized when ${\bf\delta r}$ is in the direction of $\nabla f$.
I understand from here, we would maximize $(\nabla f)\cdot{\bf\delta r}$ in the standard way, using the Cauchy'Schwarz Inequality. However, I don't understand the first part of the argument. Why would maximizing the linear term maximize the directional derivative? The only justification is a heuristic one, along the lines of "if we want to maximize the directional derivative, we want to maximize $f({\bf r}+{\bf\delta r})$ and since the linear term has the most impact than all the others, so we should focus on minimizing it." However, this doesn't really make sense to me; yes the linear term has the biggest impact, when compared to all the other terms individually. But it's not clear at all that it has the biggest impact when compared to the sum of all of the other terms.