Using the cosine activation function seems to perform very badly

Question

I have created a neural network to classify the MNIST handwritten numbers dataset. It is using softmax as the activation function for the output layer and various other functions for the hidden layer.

My implementation with the help of this question seems to be passing the gradient checks for all activation functions but when it comes to the actual run with my training data for an exemplary run of 10 iterations I get an accuracy of about 87% if I use sigmoid or tanh as the activation function for the hidden layer, but if I use cosine it returns an accuracy of 9%. Training the network with more iterations (100, 200, 500) does not have any effect either and in fact my minimization function does not manage to move below 2.18xxx for the cost function no matter how many epochs pass.

Is there some pre-processing step that I need to perform before using cosine if not why is it that this activation function works so badly?

Neil Slater · Accepted Answer · 2017-05-17T18:36:17.577

Cosine is not a commonly used activation function.

Looking at the Wikipedia page describing common activation functions, it is not listed.

And one of the desirable properties of activation functions described on that page is:

Approximates identity near the origin: When activation functions have this property, the neural network will learn efficiently when its weights are initialized with small random values. When the activation function does not approximate identity near the origin, special care must be used when initializing the weights.

$cos(0) = 1$, a basic cosine function does not have this property. Combined with its periodic nature, this makes it look like it could be particularly tricky to get correct starting conditions and other hyper-parameters in order to have a network learn whilst using it.

In addition, cosine is not monotonic, which means that error surface is likely to be more complex than for e.g. sigmoid.

I suggest trying with a low learning rate, and initialising all the bias values to $-\frac{\pi}{2}$. Maybe reduce the variance in initial weights a little too, just to start off with things close to zero. Essentially this is starting with $sin()$. Caveat: not that I have tried this myself, just an educated guess, so I would be interested to know if that helps at all with stability.

Using the cosine activation function seems to perform very badly

1 Answers1