3

I'm trying to wrap my head around the connection between statistical regression and its probability theoretical justification. In many books on statistics/machine learning, one is introduced to the idea of the loss function, which is then typically followed by a phrase of the flavour 'a popular choice for this function is mean squared loss'. As far as I understand, the justification for this choice stems from the theorem that

$$ \arg\min_{Z \in L^2(\mathcal{G})} \ \mathbb{E} \left[ (X - Z)^2 \right] = \mathbb{E} \left[ X \Vert \mathcal{G} \right] \tag{*} $$

where $X$ is the random variable to be estimated based on the information contained in $\mathcal{G}$. As far as I understand, probability theory teaches us that the conditional expectation $\mathbb{E}[X \Vert \mathcal{G}]$ is the best such estimate. If that's the case, why should our loss function still be a choice? Clearly we should be statistically estimating $\mathbb{E}[X \Vert \mathcal{G}]$, which by (*) implies minimizing the MSE.

You could argue that such reasoning is circular because we define the conditional expectation to satisfy (*), but that's doesn't seem true, as we have conditional expectations for any random variable in $L^1$, and moreover there have been many eloquent posts on this website explaining how the $L^1$ definition can be intuitively interpreted in terms of the measurability capturing the information contained in $\mathcal{G}$, etc. I would greatly appreciate it if someone could clear up my confusion.

  • If you have $X\in L^2$, then in particular $X\in L^1$ (since we're dealing with probability measures). Now, one can prove that the conditional expectation $\Bbb{E}(X,|,\mathcal{G})$, as defined using the $L^1$ definition, is indeed the orthogonal projection of $X$, relative to $L^2(\Bbb{P})\to L^2(\Bbb{P}|_{\mathcal{G}})$, and thus is the unique minimizer. – peek-a-boo Dec 31 '21 at 19:24
  • 1
    Maybe a word of the inventor himself is useful: https://math.stackexchange.com/questions/3392170/linear-fit-why-do-we-minimize-the-variance-and-not-the-sum-of-all-deviations/3392200#3392200 – Michael Hoppe Dec 31 '21 at 19:32

1 Answers1

1

When considering a loss function you also need to consider the finite-sample convergence properties, variance, and existence of the conditional mean.

For example:

If the data are heavy tailed then the sample mean can swing rapidly until $n$ is large — or the mean may not exist (e.g. Cauchy error terms). You may then opt for a more robust loss function (e.g., https://en.m.wikipedia.org/wiki/Huber_loss)

Another example is when the impact of errors is not symmetric — you may want it to be a little biased based on economic considerations (asymmetric loss).

In the end — you need to get the best decisions from your algo given the quantity and quality of the data you have — quadratic loss or MSE may not be the best choice then.

  • So if we would use the MAD criterion for example, we would accept that we're losing out on 'theoretical optimality' in favour of faster convergence, etc.? And in that case, do we just give up on the notion that we're estimating a random variable analogous to conditional expectation? Or is there an alternative theoretical framework? – Othman El Hammouchi Dec 31 '21 at 19:39
  • 1
    Either you are getting an lower error estimate of MSE (by exploiting bias-variance trade off) or you are explicitly saying that conditional expectation is not what you are after (e.., conditional median, or some more problem-specific estimand) –  Dec 31 '21 at 19:42
  • But in the first case you're still using MSE as your guiding criterion (and so presumably you're estimating the conditional expectation), even though you might not implement it using the OLS algorithm, right? As for the conditional median, I must confess my probability texts say nothing about it. However, by the above, it should be undeniable that it isn't the optimal estimator, right? So in that case you accept loss of theoretical optimality? Is there a way of quantifying why the median is valuable in approximating $X$ based on some $\mathcal{G}$? – Othman El Hammouchi Dec 31 '21 at 19:48
  • 1
    @othi correct — non-OLS tries to get conditional mean via adding bias to reduce variance. As to second point, again it depends on why you are approximating X — why is the expected value of X something good to get right? Because you are indifferent to error direction and you expect to do this many times (e.g., trading) so the mean has value. You may want to know where you are even-odds of gaining or losing, and so estimating the mean is not optimal there. –  Dec 31 '21 at 19:49
  • But theoretically speaking, $\mathbb{E}[X \Vert \mathcal{G}]$ is not the estimator that the expected value 'right', it gets $X$ itself right, i.e. its the best regressor. What indication is there that the conditional median would do the same job? According to the above, it's bound to be a worse regressor; what theoretical justification is there to think that it is a good regressor at all? – Othman El Hammouchi Dec 31 '21 at 20:59
  • @othi “right” is always in terms of a loss function. The conditional mean minimizes the MSE but not then MAE, for example — it all depends on the loss function you choose, which often is MSE because it is analytically nice and has nice large sample theory to slow approximate results (e.g., MLEs are asymptotically normal in many cases, so MSE is asymptotically the MLE of a parameter) –  Dec 31 '21 at 21:42
  • @othi basically, the issue here is language — an estimator is better than another only relative to how we measure “close” –  Dec 31 '21 at 21:44
  • 1
    @othi this discussion may help too: https://stats.stackexchange.com/questions/355538/why-does-minimizing-the-mae-lead-to-forecasting-the-median-and-not-the-mean –  Dec 31 '21 at 21:46