2

Let $v \in \{0,1\}^M$ be the visible layer, $h \in \{0,1\}^N$ be the hidden layer, where $M$ and $N$ are natural numbers. Given the biases $b \in \Re^M$, $c \in \Re^N$ and weights $W \in \Re^{M \times N}$, the energy and probability of an RBM is given by:

Energy, $E(v,h; b,c,W) = -b^T v - c^T h - h^T W^T v$

Probability, $p(v, h; b, c, W) = \frac{e^{-E(v,h; b,c,W)}}{Z}$

where $Z = \sum_{v,h} e^{-E(v,h; b,c,W)}$

The negative log likelihood error for a Restricted Boltzmann Machine (RBM) is given by:

$\mathcal{L}(b,c,W) = \frac{1}{T} \sum_{t=1}^{T} \left( -\log \sum_{h} e^{-E(v^t, h; b,c,W)} \right) + \log Z$

where:

$T$ is size of training dataset; and

$v^t$ represents $t^{th}$ data point in the training dataset

It is clear that computing $Z$ (and hence $\log Z$) is intractable because we have to sum over $2^{M+N}$ configurations of $v$ and $h$ - exponential time algorithm.

However, shouldn't computing $\sum_{h} e^{-E(v^t, h; b,c,W)}$ be intractable as well? Aren't we summing over all the $2^N$ configurations of $h$ here? If say $N = 64$, then we are already reaching exascale computations ($2^{64} = 1.84 \times 10^{19}$)!

1 Answers1

1

Yes, calculating $\sum_{h} e^{-E(v, h)}$ does include summing over $2^N$ possible configurations of the hidden vector $h$. However you don't need to actually calculate it:

If we have the negative log likelihood function $$ \mathcal{L}(W, c, b) = - \frac 1T \sum_{t=1}^T \log\left(p(v^{(t)})\right) $$ where $p(v^{(t)}) = \frac 1Z \sum_h e^{-E(v^{(t)},h)}$, we don't need to calculate $\mathcal{L}(W, c, b)$ directly, instead we want to minimize it using gradient descent. Using $\theta$ as the full parameter vector instead of $W, c, b$ for simplicity, we get: $$\begin{array}{ll} \nabla\mathcal{L}(\theta) &\displaystyle = \frac{\partial}{\partial\theta} \left(- \frac 1T \sum_{t=1}^T \log\left(p(v^{(t)})\right)\right) \\ &\displaystyle= \frac 1T \sum_{t=1}^T \left( \sum_h p(h \mid v^{(t)}) \frac{\partial E(v^{(t)},h)}{\partial\theta} - \sum_h\sum_vp(v,h)\frac{\partial E(v,h)}{\partial\theta} \right) \end{array}$$ The first term of the difference above is the expected value of $\frac{\partial E(v^{(t)},h)}{\partial\theta}$, given a specific visible vector (i.e. a training example) $v^{(t)}$. This can easily be calculated for each parameter $W, c, b$.

The second term is the expected value of the energy, calculated over the full distribution of the RBM. Unfortunately, this is again intractable. Here, the contrastive divergence algorithm by Hinton comes in useful: we run Gibbs sampling for a number of steps and estimate the second term using the resulting states $\tilde{v}$.

tl;dr: yes, the problem is intractable, which is why we use the contrastive divergence algorithm by Hinton, which is only an approximation but works well in practice.

hbaderts
  • 1,124
  • 8
  • 21