1

A survey samples $1000$ people, among which $500$ say they will vote for $A$, $400$ for $B$ e $100$ for $C$. Calculate a confidence interval for the proportion of people that will vote for $A$. What are the major weaknesses of the confidence interval you just calculated?

Let $A_n \sim \text{Binom}(n, p)$. By the CLT, $$\sqrt n (\bar A_n - p) \xrightarrow{d} \mathcal N(0, p(1 - p))$$ hence $$\frac{\sqrt n}{\sqrt{p(1 - p)}}(\bar A_n - p) \xrightarrow{d} \mathcal N(0, 1)$$ Since $p$ is unknown, we approximate it with $\bar A_n$. It follows that, at $1 - \alpha$ confidence, $$p \in \bar A_n \pm z_{1 - \frac{\alpha}2} \sqrt{\frac{\bar A_n (1 - \bar A_n)}{n}}$$

Since this confidence interval relies on the CLT, it gives poor results when $n$ is small or $p$ is very close to $0$ or $1$. But this is not the case here, as $n = 1000$ and $\bar A_n = 0.5$, so I fail to see what weaknesses the question is talking about?

rubik
  • 9,522

1 Answers1

2

On theoretical grounds there are two reasons for suspicion that a (say 95%) Wald CI may not actually have 95% coverage probability. First, it uses the normal approximation to the binomial. Second, it estimates the standard error $\sqrt{p(1-p)/n}$ by (in your notation) $\sqrt{\bar A_n(1-\bar A_n)/n}.$

Immediately obvious undesirable consequences are that the Wald CI degenerates to a point at $0$ or $1$ if the observed proportion of Successes is $0$ or $1,$ respectively.

Perhaps more seriously, the actual coverage probability of the Wald interval differs greatly from one value of $p$ to another, and that coverage probability is often below the intended $100(1-\alpha)$ percent. An easily accessible article discussing this point is Brown, Cai, and DasGupta (2001) in Statistical Science.

Answers to two related questions on this site, explore various points more deeply. They are: for degenerate intervals and theory. Also, showing graphs of coverage probabilities. Chapter 1 of Suess (2010) gives code, with explanations, for graphs of coverage probabilities.

The graphs shown in the second link are for $n = 25.$ You have used $n=1000$ for your example. The Wald interval is based on asymptotic theory and so works better for a thousand observations than for a few dozen, but in general the weaknesses remain.

BruceET
  • 52,418