23

Here the answer refers to vanishing and exploding gradients that has been in sigmoid-like activation functions but, I guess, Relu has a disadvantage and it is its expected value. there is no limitation for the output of the Relu and so its expected value is not zero. I remember the time before the popularity of Relu that tanh was the most popular amongst machine learning experts rather than sigmoid. The reason was that the expected value of the tanh was equal to zero and and it helped learning in deeper layers to be more rapid in a neural net. Relu does not have this characteristic, but why it is working so good if we put its derivative advantage aside. Moreover, I guess the derivative also may be affected. Because the activations (output of Relu) are involved for calculating the update rules.

Green Falcon
  • 14,308
  • 10
  • 59
  • 98

2 Answers2

33

The biggest advantage of ReLu is indeed non-saturation of its gradient, which greatly accelerates the convergence of stochastic gradient descent compared to the sigmoid / tanh functions (paper by Krizhevsky et al).

But it's not the only advantage. Here is a discussion of sparsity effects of ReLu activations and induced regularization. Another nice property is that compared to tanh / sigmoid neurons that involve expensive operations (exponentials, etc.), the ReLU can be implemented by simply thresholding a matrix of activations at zero.

But I'm not convinced that great success of modern neural networks is due to ReLu alone. New initialization techniques, such as Xavier initialization, dropout and (later) batchnorm also played very important role. For example, famous AlexNet used ReLu and dropout.

So to answer your question: ReLu has very nice properties, though not ideal. But it truly proves itself when combined with other great techniques, which by the way solve non-zero-center problem that you've mentioned.

UPD: ReLu output is not zero-centered indeed and it does hurt the NN performance. But this particular issue can be tackled by other regularization techniques, e.g. batchnorm, which normalizes the signal before activation:

We add the BN transform immediately before the nonlinearity, by normalizing $x = Wu+ b$. ... normalizing it is likely to produce activations with a stable distribution.

Maxim
  • 920
  • 1
  • 9
  • 20
1

I think normalization issues are minor issues in numerical computations in general and machine learning in particular if you do not encounter significant numerical overflow or underflow.

The expected value issue is one of the normalization issues.

I know many people disagree with this claim. It is fine if you have the time to make all values close to 1 so that you get perfect numerical accuracy.

I usually use Relu and ignore all the normalization issues.

Youjun Hu
  • 111
  • 3