Proof of multivariate regression plane maximizes correlation in normals

Question

I am doing a homework sheet as practice for an upcoming course in multivariate statistics and been stuck on the following problem:

Let $(Y,X_1,X_2,\ldots,X_p)^T\stackrel{d}{=}\mathcal{N}_{p+1}(0,\mathbf{C})$. Prove that $\mathbb{E}(Y|X_1,X_2,\ldots,X_p)$ maximizes Corr($Y,f(X_1,X_2,\ldots,X_p)$) in the space of $f$ linear functions.

My attempt so far:

Proof of the conditional expectation being linear and calculating its value:

We omit the linearly dependent cases due to their triviality. First as a starting point, we prove that the regression is minimized by a linear combination of the $X_i$s, and calculate the coefficients of this. Partition $\mathbf{C}$ and use the well-known identity for block matrices, using $Var Y:=\sigma^2$, $Var X_i=\sigma_i^2=c_{ii}$, $Cov(X_i,Y)=c_i$ and $Cov(X_i,X_j)=c_{ij}$: \begin{equation} \mathbf{C}=\begin{pmatrix} \sigma^2&\mathbf{c}^T\\ \mathbf{c}&\mathbf{D}\\ \end{pmatrix}\qquad \mathbf{C}^{-1}=\begin{pmatrix} \sigma^2&\mathbf{c}^T\\ \mathbf{c}&\mathbf{D}\\ \end{pmatrix}^{-1}= \begin{pmatrix} (\sigma^2-\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c})^{-1}& -\sigma^{-2}\mathbf{c}^T(\mathbf{D}-\mathbf{c}^T\mathbf{c})^{-1}\\ -\sigma^{-2}(\mathbf{D}-\mathbf{c}^T\mathbf{c})^{-1}\mathbf{c}& (\mathbf{D}-\mathbf{c}^T\mathbf{c})^{-1}\\ \end{pmatrix}=:\begin{pmatrix} \alpha^2&\beta^T\\ \beta&\delta\\ \end{pmatrix} \end{equation} With extra assumption $\det\mathbf{C}\det\mathbf{D}\neq 0$. Also note that $Var(\mathbf{a}^T\mathbf{X})=\mathbf{a}^T\mathbf{D}\mathbf{a}$. Thus their joint density function is: \begin{equation} f(y,\mathbf{x})=\frac{1}{\sqrt{(2\pi)^{p+1}\det\mathbf{C} }}\exp\left(-\frac{1}{2}\left((y,\mathbf{x})^T\mathbf{C}^{-1}(y,\mathbf{x})\right)\right)=\frac{1}{\sqrt{(2\pi)^{p+1}\det\mathbf{C} }}\exp\left(-\frac{\alpha^2y^2+2\beta^T\mathbf{x}y+\mathbf{x}^T\delta\mathbf{x}}{2}\right)= \end{equation} \begin{equation} =\frac{1}{\sqrt{(2\pi)^{p+1}\det\mathbf{C} }}\exp\left(-\frac{1}{2}\left(\alpha y-\frac{\mathbf{x}^T\beta}{\alpha}\right)^2-\frac{\mathbf{x}^T\delta\mathbf{x}}{2}+\frac{(\mathbf{x}^T\beta)^2}{2\alpha^2}\right) \end{equation} We know from the notes that $\mathbb{E}(Y-g(X_1,\ldots,X_p))^2$ is minimized by: \begin{equation} \mathbb{E}(Y|X_1,\ldots, X_p)=\frac{\displaystyle \int_{-\infty}^{\infty}yf(y,\mathbf{x})\,\mathrm{d}{y}}{\displaystyle \int_{-\infty}^{\infty}f(y,\mathbf{x})\,\mathrm{d}{y}}=\frac{\mathbf{x}^T\beta}{\alpha^2}= \boxed{-(\mathbf{x}^T\sigma^{-2}(\mathbf{D}-\mathbf{c}^T\mathbf{c})^{-1}\mathbf{c})(\sigma^2-\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c})} \end{equation} The denominator: \begin{equation} \int_{-\infty}^{\infty}\frac{1}{\sqrt{(2\pi)^{p+1}\det\mathbf{C} }}\exp\left(-\frac{1}{2}\left(\alpha y-\frac{\mathbf{x}^T\beta}{\alpha}\right)^2-\frac{\mathbf{x}^T\delta\mathbf{x}}{2}+\frac{(\mathbf{x}^T\beta)^2}{2\alpha^2}\right)\,\mathrm{d}{y}=\frac{1}{\sqrt{(2\pi)^{p}\det\mathbf{C} }\alpha}\exp\left(-\frac{\mathbf{x}^T\delta\mathbf{x}}{2}+\frac{(\mathbf{x}^T\beta)^2}{2\alpha^2}\right) \end{equation} The numerator is an expectation of a normal random variable with mean $\displaystyle\frac{\mathbf{x}^T\beta}{\alpha}$: \begin{equation} \int_{-\infty}^{\infty}\frac{y}{\sqrt{(2\pi)^{p+1}\det\mathbf{C}}}\exp\left(-\frac{1}{2}\left(\alpha y-\frac{\mathbf{x}^T\beta}{\alpha}\right)^2-\frac{\mathbf{x}^T\delta\mathbf{x}}{2}+\frac{(\mathbf{x}^T\beta)^2}{2\alpha^2}\right)\,\mathrm{d}{y} =\frac{\mathbf{x}^T\beta}{\sqrt{(2\pi)^{p}\det\mathbf{C} }\alpha^3}\exp\left(-\frac{\mathbf{x}^T\delta\mathbf{x}}{2}+\frac{(\mathbf{x}^T\beta)^2}{2\alpha^2}\right) \end{equation} which proves it is linear. We have to minimize $\displaystyle \mathbb{E}\left(Y-b-\sum_{i=1}^{p}a_iX_i\right)^2$. This is from the exponential family, thus we can switch the order of integration and differentiation: \begin{equation} \mathbb{E}\left(\frac{\partial}{\partial a_i}\left(Y-b-\sum_{j=1}^{p}a_jX_j\right)^2\right)=-2c_i+2\sum_{j=1}^{p}a_ic_{ij}\qquad \mathbb{E}\left(\frac{\partial}{\partial b}\left(Y-b-\sum_{j=1}^{p}a_jX_j\right)^2\right)=2b \end{equation} \begin{equation} \mathbb{E}\left(\frac{\partial^2}{\partial a_i^2}\left(Y-b-\sum_{j=1}^{p}a_jX_j\right)^2\right)=2c_{ii}\qquad \mathbb{E}\left(\frac{\partial^2}{\partial b^2}\left(Y-b-\sum_{j=1}^{p}a_jX_j\right)^2\right)=2 \end{equation} \begin{equation} \mathbb{E}\left(\frac{\partial}{\partial a_i \partial a_j}\left(Y-b-\sum_{j=1}^{p}a_jX_j\right)^2\right)=2c_{ij}\qquad \mathbb{E}\left(\frac{\partial}{\partial a_i \partial b}\left(Y-b-\sum_{j=1}^{p}a_jX_j\right)^2\right)=0 \end{equation} So the Hessian is positive definite at this point, meaning that the minimum is obtained a when: \begin{equation} \mathbf{D}\mathbf{a}=\mathbf{c}\Rightarrow \boxed{\mathbf{a}=\mathbf{D}^{-1}\mathbf{c}\qquad b=0} \end{equation}

And that it is the maximum

We want to find the maximum in the direction $\mathbf{v}$, so take $t(X_{i=1}^p,u)=(\mathbf{D}^{-1}\mathbf{c}+u\mathbf{v})^T\mathbf{X}$, then using $\mathbf{D}^T=\mathbf{D}$: \begin{equation} {Corr}(Y,t(X_{i=1}^p,u))=\frac{Cov(Y,(\mathbf{D}^{-1}\mathbf{c}+u\mathbf{v})^T\mathbf{X})}{\sqrt{Var Y Var((\mathbf{D}^{-1}\mathbf{c}+u\mathbf{v})^T\mathbf{X})}}= \frac{(\mathbf{D}^{-1}\mathbf{c}+ u\mathbf{v})^T\mathbf{c}} {\sigma\sqrt{\mathbf{c}^T\mathbf{D}^{-1}\mathbf{D}\mathbf{D}^{-1}\mathbf{c}+2u\mathbf{v}^T\mathbf{D}\mathbf{D}^{-1}\mathbf{c}+u^2\mathbf{v}^T\mathbf{D}\mathbf{v}}}= \end{equation} \begin{equation} =\frac{\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c}+ u\mathbf{v}^T\mathbf{c}} {\sqrt{\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c}+ 2u\mathbf{v}^T\mathbf{c}+ u^2\mathbf{v}^T\mathbf{D}\mathbf{v}}} \end{equation} \begin{equation} \frac{\partial}{\partial u}{Corr}(Y,t(X_{i=1}^p,u))= \frac{u((\mathbf{c}^T\mathbf{v})^2-\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c}\mathbf{v}^T\mathbf{D}\mathbf{v})} {\sigma(\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c}+2u\mathbf{v}^T\mathbf{c}+u^2\mathbf{v}^T\mathbf{D}\mathbf{v})^{3/2}} \end{equation} With some straightforward and lengthy calculations, we have: \begin{equation} \left.\frac{\partial^2}{\partial u^2}{Corr}(Y,t(X_{i=1}^p,u))\right|_{u=0}=-\frac{\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c}\mathbf{v}^T\mathbf{D}\mathbf{v}-(\mathbf{c}^T\mathbf{v})^2}{(\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c})^{3/2}} \end{equation} So we must have $u=0$ or $(\mathbf{v}^T\mathbf{c})^2-\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c}\mathbf{v}^T\mathbf{D}\mathbf{v}=0$, meaning that the correlation is constant in the direction of $\mathbf{v}$. Let $\mathbf{v}=\mathbf{v}_\perp+\lambda\mathbf{c}$, where $\mathbf{v}_\perp$ is orthogonal to $\mathbf{c}$, thus: \begin{equation} \mathbf{c}^T\mathbf{D}^{-1}\mathbf{c}\mathbf{v}^T\mathbf{D}\mathbf{v}-(\mathbf{v}\mathbf{c}^T)^2=\lambda^2\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c}\mathbf{c}^T\mathbf{D}\mathbf{c}+\lambda\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c}\mathbf{c}^T\mathbf{D}\mathbf{v}_\perp-(\lambda\mathbf{c}^T\mathbf{c})^2 \end{equation} If $\mathbf{Y}$ is independent, then the correlation is $0$. If that's not the case, with scaling we can achieve $\textbf{c}^T\textbf{c}=1$. $\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c}\mathbf{c}^T\mathbf{D}\mathbf{c}=(\mathbf{c}^T\mathbf{c})^2$

And there I'm just stuck, any ideas how to continue?

confused_dragon · Answer 1 · 2015-09-01T13:18:21.693

Let $(Y,X_1,X_2,\ldots,X_p)^T\stackrel{d}{=}\mathcal{N}_{p+1}(0,\mathbf{C})$. Prove that $\mathbb{E}(Y|X_1,X_2,\ldots,X_p)$ maximizes Corr($Y,f(X_1,X_2,\ldots,X_p)$) in the space of $f$ linear functions.

Notation: \ The matrix $\mathbf{C}$ be decomposed as \begin{equation*} \mathbf{C}=\begin{pmatrix} c_{yy}&\mathbf{c}^T_{yx}\\ \mathbf{c}_{yx}&\mathbf{C}_{xx}\\ \end{pmatrix} \end{equation*} Let $ \mathbf{X} = (X_1,X_2,\ldots,X_p) $ then \begin{equation*} \quad \Rightarrow \quad \mathbb{V}\mbox{ar}[Y] =c_{yy}, \quad \mathbb{C}\mbox{ov}[Y, \mathbf{X}] = \mathbf{c}_{yx} \end{equation*} and from the properties of the multivariate normal distribution we have that $$\mathbb{E}(Y|X_1,X_2,\ldots,X_p ) = \mathbf{c}_{yx} \mathbf{C}_{xx}^{-1}\mathbf{X} $$

Proof: \ The space of $f$ linear functions is given by $$ f(X_1,X_2,\ldots,X_p) = \mathbf{a}^T \mathbf{X}+ b $$ where $ \mathbf{X} = (X_1,X_2,\ldots,X_p) $, $\mathbf{a} = (a_1,a_2,\ldots,a_p) \in \mathbb{R}^p$ for $i=1,\ldots p$ and $b \in \mathbb{R}$. Recall the correlation between two random variables $Y$ and $Z$ is given by $$ \mbox{Corr}(Y, Z) = \frac{ \mbox{Cov}(X, Z) }{ \sqrt{ \mbox{Var}(Y) \mbox{Var}(Z)} } $$ If we let $ Z = \mathbf{a}^T \mathbf{X} $ we have $$ E[Z] = E[ \mathbf{a}^T \mathbf{X}+ b] = \mathbf{a}^T E[\mathbf{X}+ b = b $$ $$ \mathbb{V}\mbox{ar}[Z] = \mathbb{V}\mbox{ar}[ \mathbf{a}^T \mathbf{X}+ b] = \mathbf{a}^T \mathbb{V}\mbox{ar}[\mathbf{X}] \mathbf{a} = \mathbf{a}^T \mathbf{C}_{xx} \mathbf{a} $$ %$$ \mathbb{C}\mbox{ov}[Y, Z] = \mathbb{C}\mbox{ov}[ Y, \mathbf{a}^T \mathbf{X}+ b] = \mathbb{C}\mbox{ov}[ Y, \mathbf{X} ] \mathbf{a} = \mathbf{c}_{yx}^T \mathbf{a} $$ and we are given that $\mathbb{V}\mbox{ar}[Y] = c_{yy}$ which is a constant with respect to $(\mathbf{a},b) $ so the problem reduces to $$ \max_{ (\mathbf{a},b) \in \mathbb{R}^{p+1} } \mathbb{C}\mbox{orr}(Y, Z) = \max_{ (\mathbf{a},b) \in \mathbb{R}^{p+1} } \frac{ \mathbf{c}_{yx}^T \mathbf{a} }{ \sqrt{ c_{yy} \mathbf{a}^T \mathbf{C}_{xx} \mathbf{a} } } = \max_{ (\mathbf{a},b) \in \mathbb{R}^{p+1} } \frac{ \mathbf{c}_{yx}^T \mathbf{a} }{ \sqrt{ \mathbf{a}^T \mathbf{C}_{xx} \mathbf{a} } } $$ Some thoughts,

The variance of $Y$, $c_{yy}$ is a constant with respect to $\mathbf{a} $.
This expression does not include $b$ so we can set it to be anything. I will choose $b=0$ but $b=10$ works as well.
If $\mathbb{C}\mbox{ov}(Y, Z) = \mathbf{c}_{yx} = \mathbf{0}$ then any $\mathbf{a} \in \mathbf{R}^p$ will maximize the correlation because the correlation equals $0$ for all $\mathbf{a} \in \mathbf{R}^p$ .

Putting that aside we can make this problem easier by applying the transformation $ \mathbf{v} = \mathbf{C}_{xx}^{1/2} \mathbf{a}$ such that $\mathbf{C}_{xx}^{1/2} \mathbf{C}_{xx}^{1/2} = \mathbf{C}_{xx} $ which is possible because $ \mathbf{C}_{xx}$ is positive definite. Then in addition we let $\mathbf{u} =\mathbf{C}_{xx}^{-1/2} \mathbf{c}_{yx}$ and we have $$ \max_{ \mathbf{v} \in \mathbb{R}^{p} } \frac{ \mathbf{c}_{yx}\mathbf{C}_{xx}^{-1/2} \mathbf{a} }{ \sqrt{ \mathbf{v}^T \mathbf{v} } } = \max_{ \mathbf{v} \in \mathbb{R}^{p} } \frac{ \mathbf{u}^T \mathbf{v} }{ \sqrt{ \mathbf{v}^T \mathbf{v} } } $$ Now recall the Cauchy-Schwarz inequality, using the dot product as the inner product, states that for vectors $\mathbf{u}$ and $\mathbf{v}$ $$ | \mathbf{u}^T \mathbf{v} | \le \sqrt{ \mathbf{u}^T \mathbf{u} }\sqrt{ \mathbf{v}^T \mathbf{v} } \quad \quad \Rightarrow \quad \quad \frac{ | \mathbf{u}^T \mathbf{v} | }{\sqrt{ \mathbf{u}^T \mathbf{u} }\sqrt{ \mathbf{v}^T \mathbf{v} } } \le 1 $$ Our expression for correlation can be made equal to this because $\mathbf{u}$ is constant with respect to $\mathbf{v}$. If we have equality in the Cauchy-Schwarz inequality, we will achieve our goal of maximizing the correlation. Equality in the Cauchy-Schwarz inequality occurs when $ \mathbf{v}$ and $\mathbf{u}$ are linearly dependent. These vectors are linearly dependent if there exist a constant $\alpha$, $$ \mathbf{v} = \alpha \mathbf{u} \quad \Rightarrow \quad \mathbf{C}_{xx}^{1/2} \mathbf{a} = \alpha \mathbf{C}_{xx}^{-1/2} \mathbf{c}_{yx} \quad \Rightarrow \quad \mathbf{a} = \alpha \mathbf{C}_{xx}^{-1} \mathbf{c}_{yx} $$ and therefore when $\alpha =1 $, a linear functional of $X_1,X_2,\ldots,X_p$ that maximizes the correlation with $Y$ is $$ \mathbf{a}^T \mathbf{X} = \mathbf{c}_{yx}^T \mathbf{C}_{xx}^{-1}\mathbf{X} = \mathbb{E}(Y|X_1,X_2,\ldots,X_p) $$ Some thoughts,

This is scale multiple of $\mathbb{E}(Y|X_1,X_2,\ldots,X_p)$, setting set $\alpha =1$ obtains the desired answer however we can use $\alpha = 2$ and maximize the correlation.
This can be done using vector calculus but you basically prove one version of the Cauchy-Schwarz inequality again.

score 1 · Accepted Answer · 2015-09-01T05:37:14.930

Lets start with the univariate case:

$$X_1 \sim \mathcal{N}(\mu,\sigma)$$ with pdf $\phi(x)$ standing in for the currently specified gaussian (I'm using it as shorthand for "the density for the variable" its not necessarily the standard density).

Now, lets find a linear predictor that minimizes the mean squared error:

$$f(x):=ax+b \implies MSE_f(a,b) := \int \phi(z)(f(z)-\mu)^2dz=\int \phi(z)(az+b-\mu)^2dz=$$ $$a^2\int z^2\phi(z)dz+2a(b-\mu)\int z\phi(z)dz+(b-\mu)^2\int \phi(z)dz =$$ $$a^2(\sigma^2+\mu^2)+2a(b-\mu)\mu +(b-\mu)^2 $$

This function is convex in $a,b$, so setting $\nabla_{a,b} MSE_f=\mathbf{0}$, we get:

$$\frac{\partial}{\partial a}MSE_f = 2a(\sigma^2+\mu^2)+2\mu(b-\mu)= 0 \implies a = \frac{\mu(\mu-b)}{\sigma^2+\mu^2}$$ $$\frac{\partial}{\partial b}MSE_f = 2a\mu+2(b-\mu) =0\implies b=\mu-a\mu$$

Sub in for $b$ in equation 1 to get:

$$a(\sigma^2+\mu^2)=\mu(\mu-\mu+a\mu)=a\mu^2 \implies a=0 \implies b=\mu$$

So, if we were given $X_1$, we would minimize our squared error loss by choosing $f=E[X_1]=\mu$

Now, we haven't gotten to correlation yet, but lets go to 2 variables and using your zero mean multivariate gaussian:

$$(X_1,X_2)\sim \mathcal{N}(\mathbf{0},\mathbf{C})$$

Now, we want to find a linear function $f(X_2)$ that is maximally correlated with $X_1$. First, lets try to find the least squares solution:

$$\arg \min_{aX_2+b} \int (X_1-aX_2-b)^2\phi(X_1|X_2)$$

We don't need to go through all the tedious algebra for this, since we know that this integral is minimized when, $aX_2+b = E[X_1|X_2]$ by extension from the univariate case.

If you've taken a basic linear models course, you'll know that we can partition our variance into $Var(X_1)=Var(f(X_2)) + Var(X_1-f(X_2))$ which is just the variance of the regression function plus the variance of the residuals. However, we also know that the coefficient of determination is:

$$\rho^2 = \frac{Var(f(X_2))}{Var(f(X_2)) + Var(X_1-f(X_2))}$$

Since we minimized $Var(X_1-f(X_2))$ by setting $f(X_2)=E[X_1|X_2]$, we have maximized $\rho^2$. Since $x^2$ is strictly increasing on $[0,\infty]$, we have also maximized $|\rho|$. In particular we have also maximized $Var(f(X_2))$ s.t. $Var(f(X_2))\leq Var(X_1)$. We can show this is equivalent to maximizing the positive correlation between $f(X_2)$ and $X_1$:

$$Var(X_1)=Var(f(X_2)) + Var(X_1-f(X_2))=$$ $$Var(f(X_2))+Var(X_1)+Var(f(X_2))-2Cov(X_1,f(X_2))$$

Rearranging and combining, we get:

$$0=Var(f(X_2))-Cov(X_1,f(X_2)) \implies Cov(X_1,f(X_2))=Var(f(X_2))$$

$$\rho(X_1,f(X_2)) = \frac{Cov(X_1,f(X_2))}{\sqrt{Var(f(X_2))Var(X_1)}}=\sqrt{\frac{Var(f(X_2))}{Var(X_1)}}$$

Since $Var(X_1)$ is fixed, and we have maximized $Var(f(X_2))$ s.t. $Var(f(X_2))\leq Var(X_1)$, we have maximized the correlation too.

Extending to multivariate case

Lets say we now have $X:=(X_1,X_2,X_3)\sim \mathcal{N}(\mathbf{0},\mathbf{C})$

Since we know $X$ is multivariate gaussian, we know its marginals are also gaussian. Hence, $aX_2+bX_3+c$ is also normally distributed. Lets define $Z=aX_2+bX_3+c$, then we have another multivariate gaussian $X'=(X_1,Z)$. Applying the logic from the two variable case, we see that choosing $Z:=E[X_1|X_2,X_3]$ will maximize the correlation between $Z and X_1$. The linearity of expectation and closure of the gaussian distribution under vector operations implies that $E[X_1|X_2,X_3]$ is also a linear function of normal distributions.

So, by induction, you can extend this reasoning in pairwise fashion to an arbitrary number of dimensions.

score 0 · Answer 3 · answered Dec 16 '24 at 14:43

The result is a combination of the following facts:

The conditional expectation minimizes the Mean Squared Error (MSE), see here.
The conditional expectation is linear for multivariate normal variables, see here.
Linear regression maximizes correlation, see here.

Proof of multivariate regression plane maximizes correlation in normals

3 Answers3