Why does this help? Part of the reason is because of properties of the activation function.
Typically, most activation functions have their most interesting behavior around 0. For instance, the ReLu activation function switches from $f(x)=0$ to $f(x)=x$ at $x=0$. The sigmoid activation function has most of its interesting behavior at $x=0$, and plateaus at very large positive values of $x$ (and very negative values of $x$). Therefore, it's useful to have the inputs to the activation function be centered at 0. Transforming the inputs to have mean 0 ensures that most values are near the interesting part of the activation function.
Suppose you didn't transform the inputs, and inputs had values in the range [100,101]. Then with reasonable initial values of the weights, the input to the activation function would be some huge value (something in the vicinity of 100 times the number of inputs to that neuron). As a result, the activation function would be in a part of the input space where the activation function does nothing interesting (it's either flat or it's linear), and you're not taking advantage of the power of the neural network -- it'd be like trying to classify using a neural network where you've omitted the activation function entirely. That's not going to work well -- the activation function is an essential part of the neural network.
Now with sufficient learning, the neural network could adapt to the lack of scaling, by choosing suitable weights that basically do the rescaling for you. But initially the network is going to do very poorly, because the initial weights will be so terrible, and the optimization method (stochastic gradient descent) is going to have a hard time figuring out how to choose better weights: you're so far away from a good part of the space that any small change to the weights makes little improvement to overall performance.
So, rescaling is a heuristic that tends to make stochastic gradient descent more effective, and helps training converge more quickly and more reliably.