I am unable to understand when to use ReLU, Leaky ReLU and ELU. How do they compare to other activation functions(like the sigmoid and the tanh) and their pros and cons.
Asked
Active
Viewed 2.8k times
2 Answers
5
Look at this ML glossary:
ELU
ELU is very similiar to RELU except negative inputs. They are both in identity function form for non-negative inputs. On the other hand, ELU becomes smooth slowly until its output equal to $-\alpha$ whereas RELU sharply smoothes.
Pros
- ELU becomes smooth slowly until its output equal to $-\alpha$ whereas RELU sharply smoothes.
- ELU is a strong alternative to ReLU.
- Unlike to ReLU, ELU can produce negative outputs.
Cons
- For $x > 0$, it can blow up the activation with the output range of [0, inf].
ReLU
Pros
- It avoids and rectifies vanishing gradient problem.
- ReLu is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations.
Cons
- One of its limitations is that it should only be used within hidden layers of a neural network model.
- Some gradients can be fragile during training and can die. It can cause a weight update which will makes it never activate on any data point again. In other words, ReLu can result in dead neurons.
- In another words, For activations in the region ($x<0$) of ReLu, gradient will be 0 because of which the weights will not get adjusted during descent. That means, those neurons which go into that state will stop responding to variations in error/ input (simply because gradient is 0, nothing changes). This is called the dying ReLu problem.
- The range of ReLu is $[0,\infty)$. This means it can blow up the activation.
LeakyRelu
LeakyRelu is a variant of ReLU. Instead of being 0 when $z<0$, a leaky ReLU allows a small, non-zero, constant gradient α (Normally, $\alpha=0.01$). However, the consistency of the benefit across tasks is presently unclear. [1]
Pros
- Leaky ReLUs are one attempt to fix the “dying ReLU” problem by having a small negative slope (of 0.01, or so).
Cons
- As it possess linearity, it can’t be used for the complex Classification. It lags behind the Sigmoid and Tanh for some of the use cases.
Further reading
- Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, Kaiming He et al. (2015)
OmG
- 1,249
- 9
- 19
0
<ReLU>
Pros:
- It mitigates Vanishing Gradient Problem.
Cons:
- It causes Dying ReLU Problem.
- It's non-differentiable at
x=0.
<Leaky ReLU>
Pros:
- It mitigates Vanishing Gradient Problem.
- It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.
Cons:
- It's non-differentiable at
x=0.
<ELU>
Pros:
- It normalizes negative input values so the convergence with negative input values is stable.
- It mitigates Vanishing Gradient Problem.
- It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.
Cons:
- It's computationally expensive because of exponential operation.
- It's non-differentiable at
x = 0ifais not 1.
Super Kai - Kazuya Ito
- 165
- 7