1

I'm running the Neural Network example written in in BogoToBogo

The program worked fine:

(array([0, 0]), array([  2.55772644e-08]))
(array([0, 1]), array([ 0.99649732]))
(array([1, 0]), array([ 0.99677086]))
(array([1, 1]), array([-0.00028738]))

The neural network learned XOR, using tanh as activation function by default. However, after I changed the activation function to "sigmoid"

nn = NeuralNetwork([2,2,1], 'sigmoid')

Now the program outputs:

epochs: 0
...
epochs: 90000
(array([0, 0]), array([ 0.45784467]))
(array([0, 1]), array([ 0.48245772]))
(array([1, 0]), array([ 0.47365194]))
(array([1, 1]), array([ 0.48966856]))

The output for the 4 inputs are all near 0.5. The result shows that neural network (with the sigmoid function) didn't learn XOR.

I was expecting the program would output:

  • ~0 for (0, 0) and (1, 1)
  • ~1 for (0, 1) and (1, 0)

Can somebody explain why this example with sigmoid doesn't work with XOR?

suztomo
  • 121
  • 1
  • 5

1 Answers1

1

I found the answer by myself. The reason of the difference is that the definition of prime of tanh in BogoToBogo (tanh_prime) takes arguments that's already applied with activation function:

def tanh_prime(x):
    return 1.0 - x**2

while sigmoid_prime is not. It calls sigmoid in it:

def sigmoid_prime(x):
    return sigmoid(x)*(1.0-sigmoid(x))

So the definition of sigmoid_prime seems more accurate than tanh_prime. Then why not sigmoid is working? It's because their parameters are already applied with the activation function.

Background

The derivatives of sigmoid ($\sigma$) and tanh share the same attribute in which these derivatives can be expressed in terms of sigmoid and tanh functions themselves.

$$ \frac{d\tanh (x)}{d(x)} = 1 - \tanh (x)^2 $$ $$ \frac{d\sigma (x)}{d(x)} = \sigma(x) (1 - \sigma(x)) $$

When performing backpropagation to adjust their weights, neural networks apply the derivative ($g^{'}$) to the values that's before applied with activation function. In BogoToBogo's explanation, that's variable $ z^{(2)} $ in

$$ \delta^{(2)} = (\Theta^{(2)})^T \delta^{(3)} \cdot g^{'}(z^{(2)}). $$

In its source code, the variable dot_value holds such values. The Python implementation, however, calls the derivative with the vector stored in variable a. The vector is after applied with activation function. Why?

I interpret this as optimization to leverage the fact that derivatives of sigmoid and tanh use their parameters only to apply the original function. As the neural network already holds the value after activation function (as a), it can skip unnecessary calculation of calling sigmoid or tanh when calculating the derivatives. That's why the definition of tanh_prime in BogoToBogo does NOT call original tanh within it. However, the definition of sigmoid_prime, on the other hand, calls sigmoid function unexpectedly, resulting in miscalculation of derivative function.

Solution

Once I define sigmoid_prime in such a way that it assumes the parameter is already applied with sigmoid, then it works fine.

def sigmoid_prime(x):
    return x*(1.0-x)

Then calling the implementation with

nn = NeuralNetwork([2,2,1], 'sigmoid', 500000)

successfully outputs:

(array([0, 0]), array([ 0.00597638]))
(array([0, 1]), array([ 0.99216467]))
(array([1, 0]), array([ 0.99332048]))
(array([1, 1]), array([ 0.00717885]))
suztomo
  • 121
  • 1
  • 5