Derive confidence interval given an empirical bootstrap distribution

Question

I have used the bootstrap resampling method to obtain an empirical distribution of the sample mean (not showing here the values big they are lot), but I am not sure how would I obtain the confidence intervals from this distribution.

From Wikipedia:

...Then we compute the mean of this resample and obtain the first bootstrap mean: $\mu_1$*. We repeat this process to obtain the second resample $X_2$* and compute the second bootstrap mean $\mu_2$*. If we repeat this $100$ times, then we have $\mu_1$*, $\mu_2$*,..., $\mu_{100}$*. This represents an empirical bootstrap distribution of sample mean. From this empirical distribution, one can derive a bootstrap confidence interval for the purpose of hypothesis testing.

How would you go about deriving the $95 \%$ confidence interval from this distribution?

I have calculated a $95 \%$ confidence interval using the $1.96$ rule (just used the number $1.96$ in the formula $\overline{X} \pm 1.96 \frac{SEM}{\sqrt{n}}$, where SEM is the estimated standard error of the sample mean), but I am not sure how would I proceed here in this case.

According to Wikipedia there are some ways to compute this confidence interval which involve quantiles, etc, but I have no idea of what these quantiles are (apart from what I have just been reading and not must understanding). If you could somehow provide as much details as possible and assume I know really just a little (we had just roughly one lecture on confidence intervals).

Concrete examples are more appreciated. Thanks!

Hint: Take a look at a histogram of your set of bootstrapped means. Ultimately you'll want to compute 2 quantiles from your set. Those quantiles will serve as the upper and lower bounds of your C.I. Don't know a thing about your original sample so consider if your number of resamples is large enough. — SWilliams, Dec 16 '15 at 00:21

score 4 · Answer 1 · edited Jun 15 '20 at 14:24

There are many different kinds of bootstrap CIs, even for the basic case of finding a CI for the population mean $\mu$ of a sample based on a sample from the population. Here is a concrete example of one of them along with a brief rationale for the bootstrap method illustrated.

Here is a sample of size $n=20$ that pretty clearly does not come from a normal population. It decisively fails the Shapiro-Wilk normality test, and it has three far outliers as shown in the boxplot.

 x = c( 29,  30,  53,  75,  89,  34,  21,  12,  58,  84,  
        92, 117, 115, 119, 109, 115, 134, 253, 289, 287)
 shapiro.test(x)
Shapiro-Wilk normality test


data:  x 
 W = 0.8347, p-value = 0.002983

A bootstrap CI makes no distributional assumption about the population. All that is 'known' is that the population is capable of producing the $n$ observations in the population at hand.

In an ideal situation, we would know something about the variability of the data around $\mu.$ Specifically, we might know the distribution of $V = \bar X - \mu,$ from which we could find a 95% CI for $\mu$ by traditional means. In that case, we could find $L$ and $U$ cutting 2.5% from the upper and lower tails, respectively, of the distribution of $V$, and we could write $$P(L \le V = \bar X - \mu \le U) = P(\bar X - U \le \mu \le \bar X - L) = .95.$$ so that a 95% CI for $\mu$ would be $(\bar X - U, \bar X - L).$ However, we do not know $U$ and $L$.

Entering the 'bootstrap world', we use resampling, to find approximate values $U$ and $L$ as follows:

(1) Take $X_i, \dots, X_n$ to be the 'bootstrap population'.

(2) The mean of this population is $\mu^* = \bar X.$

(3) Simulate many values $V^*$ of $V$ by resampling: Repeatedly select a sample of size $n$ with replacement from $X_1, \dots, X_n$. For each resample, find the mean $\bar X^*$ and then $V^* = \bar X^* - \mu^*.$ Below we use $B = 100,000$ resamples of size $n = 20.$

(4) Find cutoff points $L^*$ and $U^*$ of the resample distribution $V^*.$

Then, back in the 'real' world use these proxy cutoff values to make the bootstrap CI $(\bar X - U^*, \bar X - L^*)$ for $\mu.$ Notice that $\bar X$ plays two roles here: first, as the population mean $\mu^*$ of the 'population' from which we resample; second, as itself (the mean of the original sample).

In R, this procedure can be programmed as follows, where .re (for resample) represents the stars $*$ in the notation above.

 x = c( 29,  30,  53,  75,  89,  34,  21,  12,  58,  84,  
        92, 117, 115, 119, 109, 115, 134, 253, 289, 287)
 x.bar = mean(x); n = length(x)
 B = 100000;  x.bar.re = numeric(B)
 for (i in 1:B) {
    x.bar.re[i] = mean(sample(x, n, repl=T))  }
 L.re = quantile(x.bar.re - x.bar, .025)
 U.re = quantile(x.bar.re - x.bar, .975)
 c(x.bar - U.re, x.bar - L.re)
 ##  97.5%   2.5% 
 ##  68.25 138.30

Thus a 95% CI for $\mu$ is $(68.25, 138.30).$ Because this is a random process, the result will be slightly different on each run of the program. Differences are slight for $B$ as large as $B = 100,000,$ as here. Two subsequent runs gave $(68.7, 138.3)$ and $(68.35, 138.45).$

A histogram of the resample distribution $V^*$ from one run of this program is shown below.

Notes: (a) A naive, and often misleading, approach is sometimes used. Simply, find $\bar X^*$ on each resample, and cut 2.5% from each tail of the resulting distribution of these $\bar X^*$s. This approach is sometimes used for symmetrical data, but would not work well for the current skewed sample. (b) A 95% t interval for the population mean $\mu$ is $(67.17, 144.33),$ a result that is in some doubt because of the rather extremely nonnormal data. (c) A 95% CI for the population median $\eta$ based on a Wilcoxon procedure is $(64.5, 141.5).$ But the population seems skewed so that we may have $\mu > \eta.$

Thanks for the excellent post! I found it very helpful. But I struggle to see why the "naive" approach described as (a) under Notes and misleading. Is it because the mean of the resampled distribution is not necessary the mean of the original sample (i.e. for skewed data)? — The_Anomaly, Jun 15 '20 at 14:31
Of course, there is no guarantee that the center of a bootstrap CI will be the best point estimate of the parameter, but if data are heavily skewed the point estimate may be near one end of the estimate. Sometimes, to some people, this can be a misleading outcome. There is a convincing argument for the correctness of the CI I recommended, but the naive bootstrap has no such argument. // There are people who think the naive bootstrap is OK and they may disagree with these objections. — BruceET, Jun 15 '20 at 14:43

Derive confidence interval given an empirical bootstrap distribution

1 Answers1

Linked