Activation function vs Squashing function

Question

This may seem like a very simple and obvious question, but I haven't actually been able to find a direct answer.

Today, in a video explaining deep neural networks, I came across the term Squashing function. This is a term that I have never heard or used. Our professor always used the term Activation function instead. Given the definitions I've been able to find, the two seem to be interchangeable terms.

Are they really synonymous or is there a difference?

score 8 · Answer 1 · answered Aug 06 '18 at 15:04

Activation functions like sigmoid function, hyperbolic tangent function, etc. are also called squashing function because they squash the input into a small range like in sigmoid function output is in range of [-1,1]. But you cannot call ReLU as a squashing function because for a positive input value it returns the output as same.

score 6 · Accepted Answer · edited Apr 23 '21 at 21:56

An activation function

This the name given to a function, which is applied to a neuron that just had a weight update as a result of new information. It can refer to any of the well known activation funtions, such as the Rectified Linear Unit (ReLU), the hyperbolic tangent function (tanh) or even the identity function! Have a look at somewhere like the Keras documentation for a nice little list of examples.

We usually define the activation function as being a non-linear function, as it is that property, which gives a neural network its ability to approximate any equation (given a few constraints). However, an activation function can also be linear e.g. the identity function.

A squashing function

This can mean one of two things, as far as I know, in the context of a neural network - the tag you added to the question - and they are close, just differently applied.

The first and most commonplace example, is when people refer to the softmax function, which squashes the final layer's activations/logits into the range [0, 1]. This has the effect of allowing final outputs to be directly interpreted as probabilities (i.e. they must sum to 1).

The second and newest usage of these words in the context of neural networks is from the relatively recent papers (one and two) from Sara Sabour, Geoffrey Hinton, and Nicholas Frosst, which presented the idea of Capsule Networks. What these are and how they work is beyond the scope of this question; however, the term "squashing function" deserves special mention. Paper number one introduces it followingly:

We want the length of the output vector of a capsule to represent the probability that the entity represented by the capsule is present in the current input. We therefore use a non-linear "squashing" function to ensure that short vectors get shrunk to almost zero length and long vectors get shrunk to a length slightly below 1.

That description makes it sound very similar indeed to the softmax!

This squashing function is defined as follows:

$$ v_j = \frac{||s_j||^2}{1 + ||s_j||^2} \cdot \frac{s_j}{||s_j||} $$

where $v_j$ is the vector output of capsule $j$ and $s_j$ is its total input.

If this is all new to you and you'd like to learn more, I'd recommend having a read of those two papers, as well as perhaps a nice overview blog, like this one.

score 2 · Answer 3 · answered Feb 06 '19 at 11:44

So there is a formal definition of squashing function used in the paper by Hornik, (1989), see definition 2.3. The paper demonstrates that any neural net with a single layer of sufficient number of nodes where the activation function is a 'squashing' function is a universal approximator. Given the context I think this is what is meant by squashing function.

The definition given there is any function that is non decreasing, $ \textrm{lim}_{x\rightarrow \infty} f(x) = 1$ and $ \textrm{lim}_{x\rightarrow -\infty} f(x) = 0$. So we have that ReLU is not a squashing function because $ \textrm{lim}_{x\rightarrow \infty} ReLU(x) = \infty \neq 1$ .

NB. a net with ReLU activation functions is a universal approximator, but the proof in that paper dosn't apply to it.

darkstar · Answer 4 · 2023-04-07T01:53:30.720

The purpose of the activation function is to introduce nonlinearity into the network. On the other hand, squashing is the process of mapping the output of a neuron onto a limited range, typically between 0 and 1 or between -1 and 1, to prevent the output from becoming too large or too small. This can help to stabilize the network during training and prevent numerical issues. In this sense, the softmax can also be viewed as a squashing function.

So squashing and activation may or may not be the same thing. For example, the sigmoid function is often used as both the activation and squashing functions, while the ReLU works only as an activation function.

Activation function vs Squashing function

4 Answers4

An activation function

A squashing function