3

Let $\mu$ be a probability distribution over $\mathbb{R}^n$. All functions discussed henceforth are from $\mathbb{R}^n$ to $\mathbb{R}$. Let $l^\ast$ be a linear function and $f$ be a function such $f=l^\ast$ on a set that contains strictly larger than 50% fraction of the probability mass. Show that $l^\ast$ is a minimizer of the $L_1(\mu)$ error $\|f-l\|_{L_1(\mu)}$ over all linear functions $l$.

I tried to prove this for specific $f$'s to get intuition. But, even for specific $f$'s, there is a lot of cases to handle. It didn't look nice at all, and I decided to give up.

I looked at things surrounding the area "robust linear regression" on the internet, but couldn't find an explanation for this problem.

I tried to find inspiration from the various proofs that median minimizes $\ell_1$ error of some finite set of reals. In this direction, I tried to find the point at which gradient of the objective function is equal to zero. Here too, I got to a point where there was a large number of cases to handle, and decided to give up.

3 Answers3

4

counter-example: In my notation $\mathcal L_{[a,b]}$ is the Lebesgue measure in $[a,b]$. Lets consider $\mu=\frac{\mathcal L_{[0,2]}+\mathcal L_{[3,4]}}{3}$ and define: $$f(x)=\begin{cases}x \;\text{ if } x\in[3,4] \\0 \;\;\text{ otherwise } \end{cases} $$ so in this situation we have that $l^*\equiv0$ (because $f_{|[0,2]}=0$ and $\mu([0,2])=\frac{2}{3})$. If $l(x)=x$ then: $$||f-l^*||=\frac{1}{3}\int_3^4 xdx=\frac{1}{3}\frac{7}{2}> $$ $$>||f-l||=\frac{1}{3}\int_0^2xdx =\frac{1}{3}2. $$

edit: There's a much simple counter-example, let $\mu=\mathcal L_{[0,2]}$ and $\varepsilon>0$. Be $f$ a function s.t.: $$f=\begin{cases}x\;\; x\in[1+\varepsilon,2]\\0 \;\;\text{otherwise}\end{cases}$$ hence if $l^*\equiv0,\;l(x)=x$ then: $$||f-l^*||=\int_{1+\varepsilon}^2 xdx=2-\frac{(1+\varepsilon)^2}{2}>$$ $$>||f-l||=\int_0^{1+\varepsilon}xdx=\frac{(1+\varepsilon)^2}{2}, $$ which is true for $\varepsilon$ small enough.

Bongo
  • 1,320
  • Nice one. I did not try with a R.V. with disconnected support and it seems to the trick. But now I wonder if we can make it to work somehow. May be I will have to write my own question. :) – nicomezi Aug 07 '23 at 18:30
  • Very nice! A concrete question would be, will the statement hold if $\mu$ is the uniform distribution over $[-1,1]^n$? – Mathews Boban Aug 08 '23 at 01:32
  • I edit my answer, if you want $\mu=\mathcal L_{[-2,2]}$ you just have to mirror $f$ in $[-2,0]$ (s.t. $f(-x)=-f(x)$). – Bongo Aug 08 '23 at 15:58
2

Disclaimer: this is NOT an answer but the track I have been following till I somehow got also stuck and would be interested in a possible conclusion from here. I allowed myself to post it as I think I made significant progress in the right direction (hopefully) so someone could finish the job. Please let me know if this is inapropriate and I will delete.

Let me rewrite the problem to set some notations. We want to show that

$$\arg \min_{l \text{ linear}} \|f - l\|_{L^1(\mu)} = l^*$$

where $l^*=f$ on $A \subset \mathbb{R}^n$ where $\mu(A) = \int_A d \mu >1/2$. We have:

$$\|f - l\|_{L^1(\mu)} = \int |f-l|d \mu = \int_A |f-l|d \mu+\int_{\bar{A}} |f-l|d \mu.$$

Let $l$ such that $\|f - l\|_{L^1(\mu)} \le \|f - l^*\|_{L^1(\mu)}$. Then, since $f=l^*$ on $A$:

$$\int_A |f-l|d \mu+\int_{\bar{A}} |f-l|d\mu \le \int_{\bar{A}} |f-l^*|d \mu$$

Rearranging and using the second triangle inequality yields:

$$\int_A |f-l|d \mu \le \int_{\bar{A}} |f-l^*|-|f-l|d \mu \le \int_{\bar{A}} ||f-l^*|-|f-l||d \mu \le \int_{\bar{A}} |l^*-l|d \mu$$

But, once more, $f=l^*$ on $A$, so we must have: $$\int_A |m|d \mu \le \int_{\bar{A}} |m|d \mu$$

where $m=l^*-l$, the linear map produced by the difference. Now I think we should be able to use the linearity of $m$ and the fact that $\mu(A)>1/2$ to conclude but failed to do so till now.

nicomezi
  • 8,399
2

I found a counter-example while trying to extend nicomezi's argument.

Take $n = 1$, $\mu$ to be the uniform distribution over $[-1,1]$, $l^\ast = 0$ and $l = 1 + x$. Define $f$ to be equal to $l^\ast$ in $[-1,\epsilon]$ and equal to $l$ in $[\epsilon, 1]$ for some small enough $\epsilon$ > 0.

Now $\|f-l^\ast\|_{L_1(\mu)}$ is close to 3/2 while $\|f-l\|_{L_1(\mu)}$ is close to 1/2.

  • $l(x)=1+x$ is not linear. If instead we let $l =ax$ and $\varepsilon>0$ in your example we get that $|f-l|{L_1(\mu)} = a/2(1+\varepsilon^2)$ and $|f-l^*|{L_1(\mu)} = a/2(1-\varepsilon^2)$. Then it seems true (in this case at least). – nicomezi Aug 07 '23 at 14:21
  • When I wrote the question, I meant linear functions to include affine functions. But, as you said, the question is still open when we optimize over purely linear functions. It would be interesting to find a proof or counter-example for this. – Mathews Boban Aug 07 '23 at 14:34
  • May be it is possible to construct a similar example as you did. Actually I have tried also a similar one but with linear maps and it actually worked. So may be with the added constraint that $l(0)=0$ it is true. Did you find this problem somewhere or did you come up with this on your own ? – nicomezi Aug 07 '23 at 14:45
  • It's not a problem from a textbook. But I don't know if it's entirely new. I have seen at different places on the internet that L1 regression is tolerant to outliers. But I couldn't find an explanation for our problem anywhere. – Mathews Boban Aug 07 '23 at 14:58