ReLU vs Leaky ReLU vs ELU with pros and cons

Question

I am unable to understand when to use ReLU, Leaky ReLU and ELU. How do they compare to other activation functions(like the sigmoid and the tanh) and their pros and cons.

OmG · Accepted Answer · 2021-09-25T23:19:41.433

Look at this ML glossary:

ELU

ELU is very similiar to RELU except negative inputs. They are both in identity function form for non-negative inputs. On the other hand, ELU becomes smooth slowly until its output equal to $-\alpha$ whereas RELU sharply smoothes.

Pros

ELU becomes smooth slowly until its output equal to $-\alpha$ whereas RELU sharply smoothes.

ELU is a strong alternative to ReLU.

Unlike to ReLU, ELU can produce negative outputs.

Cons

For $x > 0$, it can blow up the activation with the output range of [0, inf].

ReLU

Pros

It avoids and rectifies vanishing gradient problem.

ReLu is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations.

Cons

One of its limitations is that it should only be used within hidden layers of a neural network model.

Some gradients can be fragile during training and can die. It can cause a weight update which will makes it never activate on any data point again. In other words, ReLu can result in dead neurons.

In another words, For activations in the region ($x<0$) of ReLu, gradient will be 0 because of which the weights will not get adjusted during descent. That means, those neurons which go into that state will stop responding to variations in error/ input (simply because gradient is 0, nothing changes). This is called the dying ReLu problem.

The range of ReLu is $[0,\infty)$. This means it can blow up the activation.

LeakyRelu

LeakyRelu is a variant of ReLU. Instead of being 0 when $z<0$, a leaky ReLU allows a small, non-zero, constant gradient α (Normally, $\alpha=0.01$). However, the consistency of the benefit across tasks is presently unclear. [1]

Pros

Leaky ReLUs are one attempt to fix the “dying ReLU” problem by having a small negative slope (of 0.01, or so).

Cons

As it possess linearity, it can’t be used for the complex Classification. It lags behind the Sigmoid and Tanh for some of the use cases.

Further reading

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, Kaiming He et al. (2015)

score 0 · Answer 2 · answered Aug 12 '24 at 19:32

<ReLU>

Pros:

It mitigates Vanishing Gradient Problem.

Cons:

It causes Dying ReLU Problem.
It's non-differentiable at x=0.

<Leaky ReLU>

Pros:

It mitigates Vanishing Gradient Problem.
It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.

Cons:

It's non-differentiable at x=0.

<ELU>

Pros:

It normalizes negative input values so the convergence with negative input values is stable.
It mitigates Vanishing Gradient Problem.
It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.

Cons:

It's computationally expensive because of exponential operation.
It's non-differentiable at x = 0 if a is not 1.

ReLU vs Leaky ReLU vs ELU with pros and cons

2 Answers2

ELU

ReLU

LeakyRelu

<ReLU>

<Leaky ReLU>

<ELU>