How predictable is $Y$, given values of $X_i$s?

Question

Our goal is predicting the value $y$, given a vector of values $(x_1, \dots ,x_n)$ for some $n\geq 1$.

Binary random variable $Y$ is distributed uniformly on the set $\{0, 1\}$, so $P(Y = 1) = P(Y = 0) = 1/2$.

Each random variable $X_i$ is also binary and takes values from the set $\{0, 1\}$. Its conditional distribution is determined by the parameter $p_i = P(X_i = y\mid Y = y)$. Hence, $X_i$s are conditionally independent.

We can assume $$p_1 \geq p_2 \geq \dots \geq p_n\geq 1/2\text{.}$$

Our examples in the data are generated as follows:

$y$ is drawn according to the distribution of $Y$
$x_i$ is drawn according to the distribution $X_i$ conditional to $y$

However, the value $y$ is hidden from us, we only know the values $x_i$. Given these values, we predict the value $\hat{y}(x_1, \dots, x_n)$.

What is the maximal expected accuracy (over the possible imputs $(x_1, \dots , x_n)$), i.e., $$\max_{\hat{y}}\;E_{X_1, \dots,X_n}P_Y[\hat{y}(X_1,\dots ,X_n) =Y]?$$

I managed to solve this for $n = 1$ and $n = 2$. The key observation is the following: $$ P(Y = y\mid \vec{X} = \vec{x}) = \frac{P(Y = y)}{P(\vec{X} = \vec{x})}P(\vec{X} =\vec{x}\mid Y = y) = \frac{P(Y = y)}{P(\vec{X} = \vec{x})}\prod_i P(X_i = x_i \mid Y = y )\text{,} $$ where $\vec{X} = (X_1, \dots, X_n)$ and $\vec{x} = (x_1, \dots, x_n)$. Since $P(Y = y) = 1/2$ and $P(\vec{X} = \vec{x})$ does not depend on $y$, we will maximize the probability $P(Y = y\mid \vec{X} = \vec{x})$ if we maximize the value of $$ \prod_i P(X_i = x_i \mid Y = y) = \left(\prod_{i:\; x_i = y} p_i\right)\left(\prod_{i:\; x_i \neq y} (1 - p_i)\right) \quad (*) $$

Solutions:

$n = 1$:

The expression $(*)$ has value $p_1$ for $y = x_1$ and $1-p_1$ for $y = 1 - x_1$. Since $p_1\geq 1 - p_1$ by our assumption, it is optimal to predict $\hat{y}(x_1) = x_1$ and the answer to the question is $p_1$.

$n = 2$:

Now, we give the values of the $(*)$ in the table: $$\begin{array}{c|c|c|c} x_1 & x_2 & y = 0 & y = 1 \\ \hline 1 & 1 & (1-p_1)(1-p_2) & \color{red}{p_1 p_2} \\ 1 & 0 &(1-p_1)p_2 & \color{red}{p_1(1-p_2)}\\ 0 & 1 & \color{red}{p_1(1-p_2)}& (1 - p_1)p_2 \\ 0 & 0 & \color{red}{p_1 p_2} &(1-p_1)(1-p_2) \\ \end{array}$$ By taking into account $p_1\geq p_2 \geq 1/2$, we can show that the value of $y$ that corresponds to the red coloured $(*)$-value, is the optimal one. Again, we can see that $\hat{y}(x_1, x_2) = x_1$. The answer to the question is also again $p_1$.

Problems:

E.g., $n = 3$:

Now, the table would have eight options for $\vec{x}$. The lines, such as $\vec{x} = (1, 1, 1)$ are easy, since $(1 - p_1)(1 - p_2)(1 - p_3)\leq p_1 p_2 p_3$ because $1 - p_i \leq p_i$. However, the optimal value for $y$ is not clear for the input vector $\vec{x} = (1, 0, 1)$.

Thoughts:

Is it always optimal to predict $\hat{y}(x_1, \dots, x_n) = x_1$? By doing so and proving that $P(x_1 = y) = p_1$ (I can show that), we can assert that the answer to the question is at least $p_1$. Maybe, there is some more elegant approach than computing the red values in the table.

EDIT:

Is it always optimal to predict $\hat{y}(x_1, \dots, x_n) = x_1$?

When $n\geq 3$: not necessarily. With $p_1 = 0.8$ and $p_2 = p_3 = 0.7$, we can predict the right value of $y$ with probability $\doteq 0.825$ (result bases on simulation).

Maziar Sanjabi · Answer 1 · 2018-02-07T08:11:17.870

1

As you state correctly, you can choose $\hat{y}$ to be \begin{align} \hat{y} = \arg\max_y~\log p(y|X) \end{align} So if we define, $S_0=\{i | x_i = 0\}$ and $S_1 = \{i | x_i = 1\}$, then \begin{align} p(y=1 | X) = \sum_{i \in S_1} \log p_i + \sum_{i\in S_0} \log(1-p_i)\\ p(y=0 | X) = \sum_{i \in S_0} \log p_i + \sum_{i\in S_1} \log(1-p_i) \end{align} So if we define $\gamma = \sum_{i\in S_1} \log(\frac{p_i}{1-p_i}) - \sum_{i\in S_0} \log(\frac{p_i}{1-p_i})$. The the decision rule is if $\gamma>0$, then $\hat{y} = 1$ and otherwise, it would be $\hat{y}=0$. I do not see why this decision should always be equal to $x_1$ (This seems to be only true in the case of $n=1,2$, where the other observations cannot out weight $x_1$).

--

Now let us try to find $E(P(\hat{y}=Y))$ for our optimal $\hat{y}$. As the problem is symmetric for $Y=0$ and $Y=1$, we can see that what we are looking for is actually equal to \begin{align} P(\gamma>0|Y=1). \end{align} Now define $w_i = \log\frac{p_i}{1-p_i}\geq 0$ and $W = \sum_{i}w_i$. With these definitions, then $\gamma>0 | Y=1$ iff $\sum_i x_i w_i > \frac{W}{2}$. So the probability that we are looking for comes from the distribution of the weighted sum of bernouli random variables $x_i$. I am not sure if there is a closed form expression for such distribution, but you can look at this question and its answers: Distribution of weighted sum of Bernoulli RVs

edited Feb 07 '18 at 08:11

answered Jan 23 '18 at 19:25

Maziar Sanjabi

671

You are right. Logarithms show everything more clearly: if $p_1 > p_2 = \cdots = p_n > 1/2$, then $\log p_i / (1 - pi) > 0$ and for $n$ big enough, the other $p_i$ will prevail. However, that is not actually the main question: how predictable is $Y$? Probably, the answer can be bigger than $p_1$. I will run some numeric experiments and see. Nevertheless, +1 for you effort. – Antoine Jan 24 '18 at 09:38
With $p_1 = 0.8$ and $p_2 = p_3 = 0.7$, we can predict the right value of $y$ with probability $\doteq 0.825$. – Antoine Jan 24 '18 at 13:03
It is trivial that you can get better results than $p_1$. Just consider the case where all $p_i>\frac{1}{2}$'s are equal. Then the optimal decision would become a majority vote and not only one of them. You can easily see that the majority vote would give you a better result than $p_1$ (as it gets better with the number of observations and it is optimal). – Maziar Sanjabi Jan 24 '18 at 19:30
Well, I would not say trivial, but easy to show. However, that still does not answer the question ... – Antoine Feb 02 '18 at 12:58
Then, could you please elaborate what the question is? – Maziar Sanjabi Feb 02 '18 at 20:46
It is already outlined. So ... What is the maximal expected accuracy (over the possible imputs $(x_1, \dots , x_n$), i.e., $\max_{\hat{y}};E_{X_1, \dots,X_n}P_Y[\hat{y}(X_1,\dots ,X_n) =Y]?$ – Antoine Feb 06 '18 at 09:35

How predictable is $Y$, given values of $X_i$s?

1 Answers1