10

Suppose you have an input layer with n neurons and the first hidden layer has $m$ neurons, with typically $m < n$. Then you compute the actication $a_j$ of the $j$-th neuron in the hidden layer by

$a_j = f\left(\sum\limits_{i=1..n} w_{i,j} x_i+b_j\right)$, where $f$ is an activation function like $\tanh$ or $\text{sigmoid}$.

To train the network, you compute the reconstruction of the input, denoted $z$, and minimize the error between $z$ and $x$. Now, the $i$-th element in $z$ is typically computed as:

$ z_i = f\left ( \sum\limits_{j=1..m} w_{j,i}' a_j+b'_i \right) $

I am wondering why are the reconstructed $z$ are usually computed with the same activation function instead of using the inverse function, and why separate $w'$ and $b'$ are useful instead of using tied weights and biases? It seems much more intuitive to me to compute the reconstructed with the inverse activation function $f^{-1}$, e.g., $\text{arctanh}$, as follows:

$$ z_i' = \sum\limits_{j=1..m} \frac{f^{-1}(a_j)-b_j}{w_{j,i}^T} $$

Note, that here tied weights are used, i.e., $w' = w^T$, and the biases $b_j$ of the hidden layer are used, instead of introducing an additional set of biases for the input layer.

And a very related question: To visualize features, instead of computing the reconstruction, one would usually create an identity matrix with the dimension of the hidden layer. Then, one would use each column of the matrix as input to a reactivation function, which induces an output in the input neurons. For the reactivation function, would it be better to use the same activation function (resp. the $z_i$) or the inverse function (resp. the $z'_i$)?

Manfred Eppe
  • 101
  • 1
  • 4

1 Answers1

5

I don't think that your assumption $w' = w^T$ holds. Or rather is not necessary, and if it is done, it is not in order to somehow automatically reverse the calculation to create the hidden layer features. It is not possible to reverse the compression in general, going from n to smaller m, directly in this way. If that was the goal, then you would want a form of matrix inversion, not simple transpose.

Instead we just want $w_{ij}$ for the compressed higher-level feature representation, and will discard $w'_{ij}$ after auto-encoder is finished.

You can set $w' = w^T$ and tie the weights. This can help with regularisation - helping the autoencoder generalise. But it is not necessary.

For the autoencoder to function it doesn't actually matter what activation function you use after the layer that you are pre-training, provided the last layer of the autoencoder can express the range of possible inputs. However, you may get varying quality of results depending on what you use, as normal for a neural network.

It is quite reasonable to use the same activation function that you are building the pre-trained layer for, as it is the simplest choice.

Using an inverse function is possible too, but not advisable for sigmoid or tanh, because e.g. arctanh is not defined < -1 or > 1, so likely would not be numerically stable.

Neil Slater
  • 29,388
  • 5
  • 82
  • 101