I think this is what you are asking. You have to respect the permutation layer, otherwise canot apply the piling up lemma.
The approximations must be "joined together" between rounds by respecting the P-box mapping, so that they can be "added together with cancellation" to get multiround linear characteristics.
Say you apply an approximation $$a\cdot x\oplus b\cdot S(x)$$ to Sbox $S_{1,1}$ [round 1, 1st sbox] whose output constant $b$ is [1,0,0,1]. If you follow the wires at the output of this Sbox, you can see that you need to apply some linear approximation $$a'\cdot x \oplus b \cdot S(x)$$
where $a'=[1,0,0,0]$ to Sboxes $S_{2,1},$ and $S_{2,4}$. Then you use the fact that the key bits cancel and apply the piling up lemma to get a 2 round approximation involving Sboxes $S_{1,1},S_{2,1},S_{2,4}$.
Where is the optimization? You need to stick to relatively low hamming weight $b$ vectors to avoid turning on too many S-boxes since the piling up lemma is multiplicative and each new active S-box reduces the overall bias. You also want to find the highest individual bias characteristics.
In general, this needs to be checked exhaustively.
The following answer gives more general details about linear approximations:
Help with linear cryptanalysis