Finding a Correlation between Bernoulli Variables?

Question

Let X and Y be Bernoulli random variables. We don't assume independence or identical distribution, but we do assume that all 4 of the following probabilities are nonzero.

Let a := P[X = 1, Y = 1], b := P[X = 1, Y = 0], c := P[X = 0, Y = 1], and d := P[X = 0, Y = 0].

How do I obtain a formula for a correlation between random variables X and Y?

score 14 · Answer 1 · answered Nov 12 '15 at 02:01

Stefan Hansen's hint is a good one. Here is the complete derivation: $${\rm E}[X]=a+b=p$$ $${\rm E}[Y]=a+c=q$$ \begin{align} \mathrm{Var}(X) & ={\rm E}[(X - {\rm E}[X])^2] \\ & = {\rm E}[(X - p)^2] \\ &= p(1-p)^2 + (1-p)(-p)^2 \\ & = p (1-2p+p^2) + p^2 - p^3 \\ & = p - 2p^2 + p^3+p^2-p^3 \\ & = p - p^2 \\ & = p(1-p) \end{align} $$\sigma{_X} =\sqrt{\mathrm{Var}(X)} = \sqrt{p(1-p)} = \sqrt{(a+b)(1-(a+b))}$$ $$\sigma{_Y} =\sqrt{\mathrm{Var}(Y)} = \sqrt{q(1-q)} = \sqrt{(a+c)(1-(a+c))}$$ \begin{align} \mathrm{Cov}(X, Y) &= \rm{E}[XY] - \rm{E}[X]\rm{E}[Y] \\ &=a - pq \\ &=a - (a+b)(a+c) \\ \end{align} Finally, by substitution into the equation for $\rho_{XY}$: \begin{align} \rho_{XY}&=\frac{\mathrm{Cov}(X,Y)}{\sigma_{X}\sigma_{Y}} \\ &=\frac{a - (a+b)(a+c)}{\sqrt{(a+b)(1-(a+b))}\sqrt{(a+c)(1-(a+c))}} \\ &=\frac{a - (a+b)(a+c)}{\sqrt{(a+b)(1-(a+b))(a+c)(1-(a+c))}} \end{align}

Note that the numerator simplifies to the determinant ad-bc. — RDBury, Mar 05 '20 at 04:57

score 5 · Answer 2 · answered Dec 17 '13 at 14:23

Hint: The correlation is defined in terms of $\mathrm{Cov}(X,Y)$, $\mathrm{Var}(X)$ and $\mathrm{Var}(Y)$ which can be computed if we know the following quantities $$ {\rm E}[X],{\rm E}[X^2],{\rm E}[Y],{\rm E}[Y^2],{\rm E}[XY]. $$ These should be straightforward to find. For instance, $$ {\rm E}[X^2]={\rm E}[X]=P(X=1)=a+b. $$

score 4 · Answer 3 · edited Sep 25 '15 at 04:21

4

Presumably you're talking about the Pearson correlation coefficient. It's defined in terms of the covariance and the standard deviations. These in turn are defined in terms of expected values, which are defined in terms of probabilities. For example, $E[X] = 1 \cdot P(X=1) + 0 \cdot P(X=0) = a + b$.

edited Sep 25 '15 at 04:21

pjvandehaar

105
3

answered Dec 17 '13 at 14:23

Robert Israel

470,583

score 1 · Answer 4 · edited Mar 27 '20 at 22:51

The above answer may be generalized to cover the case where rho is a user selected value and the values $E(x)=E(y)=\mu$ and $V(x)=V(y)=\sigma^2$ are equal and known. In this case:

$E(x)=a+b=E(y)=a+c=\mu$, therefore $b=c$ and $(a+b)*(a+c)=\mu^2$ (note $\mu<1$)

The $a,b,c$,and $d$ values are joint probability values such that:

$a+b+c+d=1$ by definition. So that $d= 1-(a+b+c)$

And by using the definition of the variance for Bernoulli random variables, by setting $V(x)=V(y)=\sigma^2$ the equation below follows:

$V(x)=(a+b)\cdot[1-(a+b)]=V(y)=(a+c)*[1-(a+c)]=\sigma^2=(\mu^2)\cdot(1-\mu)^2$

Then, using the originally derived equations for rho(xy)and Cov(x,y) above, it is possible to define rho in terms of the values of just a and mu, such that:

$\rho= \frac{a - \mu^2}{\mu(1-\mu)}$

Then solving for a we see that all values are known from rho, mu and sigma:

$$a= \mu^2 + [\rho(\mu(1-\mu))]\\ b= \mu - a\\ c= b\\ d= 1-(a+b+c)$$

Two correlated Bernoulli random variables can be simulated by first defining a discrete random variable $Z$ with four values: $3,2,1,0$ with the respective probabilities of $a,b,c,d$ corresponding to the four joint probability states. Then,

$x=1$ when $Z=3$ or $2$, $x=0$ otherwise $y=1$ when $Z=3$ or $1$, $y=0$ otherwise

A Monte Carlo simulation with an Excel add-in then demonstrates that the $x$ and $y$ variables are correlated as specified by rho given the known values of $\mu$ and $\sigma$. The Excel CORREL function for $5000$ samples in a Monte Carlo simulation verifies this. Also, as a cross check when $\rho=0$ you may quickly show that the values of $a,b,c$, and $d$ are properly computed for the case where $x$ and $y$ are independent random variables.

Thanks to all of the original contributors for the work that prompted the above derivation.

Could you please format this in mathjax? It would be taken a lot more seriously by a lot of users, if you did. — amWhy, Mar 27 '20 at 21:58
Believe that the webmaster for Stack Exchange did that in the last "edit" submitted. I am not a MathJax user. One clarification to the original write-up: in the second sentence it is the expected values of x and y that are set equal, as are the variances of x and y. As originally worded that could be confused. — RayK, Mar 28 '20 at 23:20

Finding a Correlation between Bernoulli Variables?

4 Answers4

Linked