You are right that if we have some step size $\eta>0$ in mind, the gradient will not be the optimal direction to move in. Say that $l$ is our loss and $w$ are our neural network weights. Then there is in fact some optimal direction:
$$
\delta^* = \underset{\delta\in\mathbb{R}^P,\Vert\delta\Vert=1}{\textrm{argmin}} \, l(w- \eta\delta) \,,
$$
and this will give better loss than the gradient descent direction with step size $\eta$, that is, $l(w-\eta\delta^*)\leq l(w-\eta\frac{\nabla_w l(w)}{\Vert\nabla_w l(w)\Vert})$. If $\eta$ is sizable, it might be much better!
So why do we use the gradient instead of using $\delta^*$? It's because finding $\delta^*$ is about as hard as finding the optimal neural network parameters $w$ in the first place: to find $\delta^*$ we would still need to do an optimization over $l$, which is what we are currently trying to solve!
We use the gradient not because it is the optimal direction to move in for fixed step size $\eta$: it isn't. We use the gradient because it can be easily computed. In many realistic problems, the gradient actually does a bad job of telling us what direction to move in for larger $\eta$; this is where the idea of "preconditioning" comes from, and indeed this is (in part) what motivates the cottage industry of alternative first order optimization methods which has proliferated in ML research, such as Adam, RMSgrad, etc.
By the way, why is it that the gradient is easily computed but this optimal step direction isn't? This is a manifestation of a general phenomenon about high dimensional complex functions: only "local" information about such functions is tractable. The gradient depends only on the behavior of a function in an arbitrarily small neighborhood of a point, i.e. it is a local quantity. On the other hand, using $\delta^*$ depends on the behavior of the function on some sphere of fixed, nontrivial diameter, and hence it is intractable. For similar reasons, the integral of a high dimensional complex function is computationally intractable.