I'm trying to wrap my head around the connection between statistical regression and its probability theoretical justification. In many books on statistics/machine learning, one is introduced to the idea of the loss function, which is then typically followed by a phrase of the flavour 'a popular choice for this function is mean squared loss'. As far as I understand, the justification for this choice stems from the theorem that
$$ \arg\min_{Z \in L^2(\mathcal{G})} \ \mathbb{E} \left[ (X - Z)^2 \right] = \mathbb{E} \left[ X \Vert \mathcal{G} \right] \tag{*} $$
where $X$ is the random variable to be estimated based on the information contained in $\mathcal{G}$. As far as I understand, probability theory teaches us that the conditional expectation $\mathbb{E}[X \Vert \mathcal{G}]$ is the best such estimate. If that's the case, why should our loss function still be a choice? Clearly we should be statistically estimating $\mathbb{E}[X \Vert \mathcal{G}]$, which by (*) implies minimizing the MSE.
You could argue that such reasoning is circular because we define the conditional expectation to satisfy (*), but that's doesn't seem true, as we have conditional expectations for any random variable in $L^1$, and moreover there have been many eloquent posts on this website explaining how the $L^1$ definition can be intuitively interpreted in terms of the measurability capturing the information contained in $\mathcal{G}$, etc. I would greatly appreciate it if someone could clear up my confusion.