RBM's are an interesting beast. To answer your question, and to jog my memory on them, I'll derive RBMs and talk through the derivation. You mentioned that you're confused on the likelihood, so my derivation will be from the perspective of trying to maximize the likelihood. So let's begin.
RBMs contain two different sets of neurons, visible and hidden, I'll denote them $v$ and $h$ respectively. Given a specific configuration of $v$ and $h$, we map it the probability space.
$$p(v,h) = \frac{e^{-E(v,h)}}{Z}$$
There are a couple things more to define. The surrogate function we use to map from a specific configuration to the probability space is called the energy function $E(v,h)$. The $Z$ constant is a normalization factor to ensure that we actually map to the probability space. Now let's get to what we're really looking for; the probability of a set of visible neurons, in other words, the probability of our data.
$$Z = \sum_{v \in V}\sum_{h \in H}e^{-E(v,h)}$$
$$p(v)=\sum_{h \in H}p(v,h)=\frac{\sum_{h \in H}e^{-E(v,h)}}{\sum_{v \in V}\sum_{h \in H}e^{-E(v,h)}}$$
Although there are a lot of terms in this equation, it simply comes down to writing the correct probability equations. Hopefully, so far, this has helped you realize why we need energy function to calculate the probability, or what is done more usually the unnormalized probability $p(v)*Z$. The unnormalized probability is used because the partition function $Z$ is very expensive to compute.
Now let's get to the actual learning phase of RBMs. To maximize likelihood, for every data point, we have to take a gradient step to make $p(v)=1$. To get the gradient expressions it takes some mathematical acrobatics. The first thing we do is take the log of $p(v)$. We will be operating in the log probability space from now on in order to make the math feasible.
$$\log(p(v))=\log[\sum_{h \in H}e^{-E(v,h)}]-\log[\sum_{v \in V}\sum_{h \in H}e^{-E(v,h)}]$$
Let's take the gradient with respect to the paremeters in $p(v)$
\begin{align}
\frac{\partial \log(p(v))}{\partial \theta}=&
-\frac{1}{\sum_{h' \in H}e^{-E(v,h')}}\sum_{h' \in H}e^{-E(v,h')}\frac{\partial E(v,h')}{\partial \theta}\\ & + \frac{1}{\sum_{v' \in V}\sum_{h' \in H}e^{-E(v',h')}}\sum_{v' \in V}\sum_{h' \in H}e^{-E(v',h')}\frac{\partial E(v,h)}{\partial \theta}
\end{align}
Now I did this on paper and wrote the semi-final equation down as to not waste a lot of space on this site. I recommend you derive these equations yourself. Now I'll write some equations down that will help out in continuing our derivation. Note that: $Zp(v,h)=e^{-E(v,h')}$, $p(v)=\sum_{h \in H}p(v,h)$ and that $p(h|v) = \frac{p(v,h)}{p(h)}$
\begin{align}
\frac{\partial log(p(v))}{\partial \theta}&=
-\frac{1}{p(v)}\sum_{h' \in H}p(v,h')\frac{\partial E(v,h')}{\partial \theta}+\sum_{v' \in V}\sum_{h' \in H}p(v',h')\frac{\partial E(v',h')}{\partial \theta}\\
\frac{\partial log(p(v))}{\partial \theta}&=
-\sum_{h' \in H}p(h'|v)\frac{\partial E(v,h')}{\partial \theta}+\sum_{v' \in V}\sum_{h' \in H}p(v',h')\frac{\partial E(v',h')}{\partial \theta}
\end{align}
And there we go, we derived maximum likelihood estimation for RBM's, if you want you can write the last two terms via expectation of their respective terms (conditional, and joint probability).
Notes on energy function and stochasticity of neurons.
As you can see above in my derivation, I left the definition of the energy function rather vague. And the reason for doing that is that many different versions of RBM implement various energy functions. The one that Hinton describes in the lecture linked above, and shown by @Laurens-Meeus is:
$$E(v,h)=−a^Tv−b^Th−v^TWh.$$
It might be easier to reason about the gradient terms above via the expectation form.
$$\frac{\partial \log(p(v))}{\partial \theta}=
-\mathop{\mathbb{E}}_{p(h'|v)}\frac{\partial E(v,h')}{\partial \theta}+\mathop{\mathbb{E}}_{p(v',h')}\frac{\partial E(v',h')}{\partial \theta}$$
The expectation of the first term is actually really easy to calculate, and that was the genius behind RBMs. By restricting the connection the conditional expectation simply becomes a forward propagation of the RBM with the visible units clamped. This is the so called wake phase in Boltzmann machines. Now calculating the second term is much harder and usually Monte Carlo methods are utilized to do so. Writing the gradient via average of Monte Carlo runs:
$$\frac{\partial \log(p(v))}{\partial \theta}\approx
-\langle \frac{\partial E(v,h')}{\partial \theta}\rangle_{p(h'|v)}+\langle\frac{\partial E(v',h')}{\partial \theta}\rangle_{p(v',h')}$$
Calculating the first term is not hard, as stated above, therefore Monte-Carlo is done over the second term. Monte Carlo methods use random successive sampling of the distribution, to calculate the expectation (sum or integral). Now this random sampling in classical RBM's is defined as setting a unit to be either 0 or 1 based on its probability stochasticly, in other words, get a random uniform number, if it is less than the neurons probability set it to 1, if it is greater than set it to 0.