I am a trying to understand the SELU activation function and I was wondering why deep learning practitioners keep using RELU, with all its issues, instead of SELU, which enables a neural network to converge faster and internally normalizes each layer?
2 Answers
ReLU is quick to compute, and also easy to understand and explain. But I think people mainly use ReLU because everyone else does. The activation function doesn't make that much of a difference, and proving or disproving that requires adding yet another dimension of hyperparameter combinations to try.
If the research is for a paper there is another consideration: you will want to stick with what your benchmarks use, what everyone else is doing, unless the research is specifically about activation functions.
(As an aside, I see practically no research on the pros or cons of using different activation functions at different layers. I suspect this is also because of the hyperparameter combinatorial explosion, combined with the expectation of it not making much difference.)
The SELU function is a hard-sell in a couple of ways. First it requires reading a long paper to understand, and accept the couple of magic numbers it comes with. But a bigger factor might be that it does internal normalization, meaning you don't need your batch or layer normalization any more. Or do you? Suddenly this is not a simple swap in for ReLU, but affects other parts of the architecture.
This is a good article on a large selection of alternative activation functions: https://mlfromscratch.com/activation-functions-explained/ The con they give there for SELU is that there are not enough comparative research papers on it, for different architectures, yet.
- 1,324
- 8
- 16
I want to answer this question with respect to my experience with scientific papers. The point is that when practitioners try to make new ideas, they should have ablation study in their work. This means that they should satisfy the readers and the reviewers that the claimed improvements are real. They should concentrate on the novelty of their work. This means that in papers, and probably their implementations, you usually do not see state-of-the-art modules. They leverage trivial modules and try to show the effectiveness of their own new-designed or manipulated module(s). This is why everyone knows there are so many optimisation approaches that are better than Adam but there are still numerous novel papers that exploit Adam for optimisation.
On the other hand, if a scientist uses a typical state-of-the-art approach alongside his novel approach, he should prove that the improvement is not solely due to the state-of-the-art things he's used that are apart from his original work. This means the ablation study should be longer. Students usually avoid this.
- 14,308
- 10
- 59
- 98