The paper also discusses linear characteristics for multiple rounds:
We present several approaches to produce linear characteristics for
SIMON32/64 and present the best known linear characteristic for
11-round SIMON 32/64 with the bias of $2^{-16}$. We then extend this
characteristic to 13 rounds of the cipher.
We present several
approaches to produce an LC for SIMON32/64, as a case study, and
present the best known 11-round LC for this cipher with the bias of
$2^{-16}$ (expendable to 13 rounds of the cipher).
The linear approximations you list seem to be from equation 3. SIMON is a Feistel cipher. By combining the equation that describes one round of Feistel, the authors of the paper get the following linear expression for an entire round:
$$ (P_R)_2 \oplus (K^1)_2 \oplus (X^1_L)_2 = (P_L)_0, $$
which holds with probablity $3/4$. Here $X^1 = X^1_L || X^1_R$ is the output of the first round, and $P = P_L || P_R$ is the plaintext. The notation $(X)_i$ is used for the $i$th bit of $X$.
Note that the expression is written here for the first round, but it equally applicable to any round $i$, when written in the following form:
$$ (X^{i-1}_R)_2 \oplus (K^i)_2 \oplus (X^i_L)_2 = (X^{i - 1}_L)_0, $$
where we replaced $P$ by $X^{i - 1}$ and $X^1$ by $X^i$.
This can of course also be written as
$$ (X^{i-1}_R)_2 \oplus (K^i)_2 \oplus (X^{i - 1}_L)_0 = (X^i_L)_2. $$
This is still true with probability $3/4$.
If we "substitute" (pile up) this last equation into a 3 round Feistel network (shown by Figure 3 in the paper), we get:
$$(X^{i - 1}_R)_2 \oplus (K^i)_2 \oplus (X_L^{i -1})_0 = (X^{i + 2}_R)_0 \oplus (K^{i + 2})_2 \oplus (X_L^{i + 1})_2,$$
or:
$$\Sigma_K \oplus (X^{i - 1}_R)_2 \oplus (X_L^{i -1})_0 \oplus (X^{i + 2}_R)_0 \oplus (X_L^{i + 1})_2 = 0,$$
with $\Sigma_K = (K^i)_2 \oplus (K^{i + 2})_2$.
The paper goes on to use this expression for more rounds, but we can stop here. The above expression is a linear characteristic for 3 rounds of the cipher. Since we know with which probability the above expression holds (depending on the parity of $\Sigma_K$), we can start obtaining key bits.
This document explains that part well:
The process followed involves partially decrypting the last round of
the cipher. Specifically, for all possible values of the target
partial subkey, the corresponding ciphertext bits are exclusive-ORed
with the bits of the target partial subkey and the result is run
backwards through the corresponding S-boxes. This is done for all
known plaintext/ciphertext samples and a count is kept for each value
of the target partial subkey. The count for a particular target
partial subkey value is incremented when the linear expression holds
true for the bits into the last round’s S-boxes (determined by the
partial decryption) and the known plaintext bits. The target partial
subkey value which has the count which differs the greatest from half
the number of plaintext/ciphertext samples is assumed to represent the
correct values of the target partial subkey bits. This works because
it is assumed that the correct partial subkey value will result in the
linear approximation holding with a probability significantly
different from 1/2. (Whether it is above or below 1/2 depends on
whether a linear or affine expression is the best approximation and
this depends on the unknown values of the subkey bits implicitly
involved in the linear expression.) An incorrect subkey is assumed to
result in a relatively random guess at the bits entering the S-boxes
of the last round and as a result, the linear expression will hold
with a probability close to 1/2.
The above applies to SPNs, but for a Feistel network the prinicple is the same.