With neural networks, back-propagation is an implementation of the chain rule. However, the chain rule is only applicable for differentiable functions. With non-differentiable functions, there is no chain rule that works in general. And so, it seems that back-propagation is invalid when we use a non-differentiable activation function (e.g. Relu).
The words that are stated around this seeming error is that "the chance of hitting a non-differentiable point during learning is practically 0". It's not clear to me, though, that landing on a non-differentiable point during learning is required in order to invalidate the chain rule.
Is there some reason why we should expect back-propagation to yield an estimate of the (sub)gradient? If not, why does training a neural network usually work?