11

Let X and Y be Bernoulli random variables. We don't assume independence or identical distribution, but we do assume that all 4 of the following probabilities are nonzero.

Let a := P[X = 1, Y = 1], b := P[X = 1, Y = 0], c := P[X = 0, Y = 1], and d := P[X = 0, Y = 0].

How do I obtain a formula for a correlation between random variables X and Y?

4 Answers4

14

Stefan Hansen's hint is a good one. Here is the complete derivation: $${\rm E}[X]=a+b=p$$ $${\rm E}[Y]=a+c=q$$ \begin{align} \mathrm{Var}(X) & ={\rm E}[(X - {\rm E}[X])^2] \\ & = {\rm E}[(X - p)^2] \\ &= p(1-p)^2 + (1-p)(-p)^2 \\ & = p (1-2p+p^2) + p^2 - p^3 \\ & = p - 2p^2 + p^3+p^2-p^3 \\ & = p - p^2 \\ & = p(1-p) \end{align} $$\sigma{_X} =\sqrt{\mathrm{Var}(X)} = \sqrt{p(1-p)} = \sqrt{(a+b)(1-(a+b))}$$ $$\sigma{_Y} =\sqrt{\mathrm{Var}(Y)} = \sqrt{q(1-q)} = \sqrt{(a+c)(1-(a+c))}$$ \begin{align} \mathrm{Cov}(X, Y) &= \rm{E}[XY] - \rm{E}[X]\rm{E}[Y] \\ &=a - pq \\ &=a - (a+b)(a+c) \\ \end{align} Finally, by substitution into the equation for $\rho_{XY}$: \begin{align} \rho_{XY}&=\frac{\mathrm{Cov}(X,Y)}{\sigma_{X}\sigma_{Y}} \\ &=\frac{a - (a+b)(a+c)}{\sqrt{(a+b)(1-(a+b))}\sqrt{(a+c)(1-(a+c))}} \\ &=\frac{a - (a+b)(a+c)}{\sqrt{(a+b)(1-(a+b))(a+c)(1-(a+c))}} \end{align}

5

Hint: The correlation is defined in terms of $\mathrm{Cov}(X,Y)$, $\mathrm{Var}(X)$ and $\mathrm{Var}(Y)$ which can be computed if we know the following quantities $$ {\rm E}[X],{\rm E}[X^2],{\rm E}[Y],{\rm E}[Y^2],{\rm E}[XY]. $$ These should be straightforward to find. For instance, $$ {\rm E}[X^2]={\rm E}[X]=P(X=1)=a+b. $$

Stefan Hansen
  • 26,160
  • 7
  • 62
  • 95
4

Presumably you're talking about the Pearson correlation coefficient. It's defined in terms of the covariance and the standard deviations. These in turn are defined in terms of expected values, which are defined in terms of probabilities. For example, $E[X] = 1 \cdot P(X=1) + 0 \cdot P(X=0) = a + b$.

pjvandehaar
  • 105
  • 3
Robert Israel
  • 470,583
1

The above answer may be generalized to cover the case where rho is a user selected value and the values $E(x)=E(y)=\mu$ and $V(x)=V(y)=\sigma^2$ are equal and known. In this case:

$E(x)=a+b=E(y)=a+c=\mu$, therefore $b=c$ and $(a+b)*(a+c)=\mu^2$ (note $\mu<1$)

The $a,b,c$,and $d$ values are joint probability values such that:

$a+b+c+d=1$ by definition. So that $d= 1-(a+b+c)$

And by using the definition of the variance for Bernoulli random variables, by setting $V(x)=V(y)=\sigma^2$ the equation below follows:

$V(x)=(a+b)\cdot[1-(a+b)]=V(y)=(a+c)*[1-(a+c)]=\sigma^2=(\mu^2)\cdot(1-\mu)^2$

Then, using the originally derived equations for rho(xy)and Cov(x,y) above, it is possible to define rho in terms of the values of just a and mu, such that:

$\rho= \frac{a - \mu^2}{\mu(1-\mu)}$

Then solving for a we see that all values are known from rho, mu and sigma:

$$a= \mu^2 + [\rho(\mu(1-\mu))]\\ b= \mu - a\\ c= b\\ d= 1-(a+b+c)$$

Two correlated Bernoulli random variables can be simulated by first defining a discrete random variable $Z$ with four values: $3,2,1,0$ with the respective probabilities of $a,b,c,d$ corresponding to the four joint probability states. Then,

$x=1$ when $Z=3$ or $2$, $x=0$ otherwise $y=1$ when $Z=3$ or $1$, $y=0$ otherwise

A Monte Carlo simulation with an Excel add-in then demonstrates that the $x$ and $y$ variables are correlated as specified by rho given the known values of $\mu$ and $\sigma$. The Excel CORREL function for $5000$ samples in a Monte Carlo simulation verifies this. Also, as a cross check when $\rho=0$ you may quickly show that the values of $a,b,c$, and $d$ are properly computed for the case where $x$ and $y$ are independent random variables.

Thanks to all of the original contributors for the work that prompted the above derivation.

RayK
  • 11
  • Could you please format this in mathjax? It would be taken a lot more seriously by a lot of users, if you did. – amWhy Mar 27 '20 at 21:58
  • Believe that the webmaster for Stack Exchange did that in the last "edit" submitted. I am not a MathJax user. One clarification to the original write-up: in the second sentence it is the expected values of x and y that are set equal, as are the variances of x and y. As originally worded that could be confused. – RayK Mar 28 '20 at 23:20