Why Does The "Bootstrap Method" Work?

Question

Consider the "Bootstrap Method" (https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) in Probability and Statistics.

As I understand, the Bootstrap Method is a useful procedure that can be used to estimate the "empirical distribution" of some "statistic" (e.g. mean) for some observed data. In the Bootstrap Method:

First, we take a random sample (e.g. 70%) of the collected data and calculate the "statistic" from this random sample.
Next, repeat this above step many times. Each time, you will have a "version" of this "statistic" corresponding to each random resample.
Finally, rank all these "versions" from smallest to largest - by taking the "version" corresponding to the 5th percentile and the 95th percentile from this ranked list, you can effectively place "Confidence Intervals" on this "statistic".

The Bootstrap Method is said to be particularly advantageous as it allegedly "works" in many otherwise difficult circumstances where a closed-form distribution for the "statistic" of interest might not be readily known or available. Our professor demonstrated that the Bootstrap Method does in fact work, and showed us some examples with randomly simulated data where the closed-form distributions for the "statistic" of interest is known - and it is easy to compare the solutions generated from the Bootstrap Simulations and the analytical answer. But in the back of my mind, I always play "Devils Advocate" and wonder - how do I know that the Bootstrap "just happens" to work in this example, and perhaps in the next example, we might not be as lucky.

I tried asking one of my professors as to why exactly the Bootstrap Method works - but the professor replied that its because of the Law of Large Numbers (https://en.wikipedia.org/wiki/Law_of_large_numbers). While this is probably true, I was hoping to find a more "detailed reason" as to why the Bootstrap Method works. As an example, (my understanding of) the Law of Large Numbers applies in situations where you have access to the entire population and can you can resample this population an infinite number of times - whereas in situations where the Bootstrap is used, you have a (possibly imperfect) sample from the original population, and can only resample this sample of the population. This makes me a bit unsure if extending the use of the Law of Large Numbers to justify the correctness of the Bootstrap is legitimate.

I found what seems to be a very informative University Lecture on this subject (https://www.stat.cmu.edu/~larry/=sml/Boot.pdf) in which proofs are even provided - but I don't think my knowledge of mathematics is currently adequate enough to understand this proof by myself.

I was hoping that perhaps someone here might be able to walk me through a simplified version of this proof - or perhaps provide another simplified version of a similar proof which demonstrates why the Bootstrap Method "works".

Thanks!

I think you may get better answers at https://stats.stackexchange.com/ — angryavian, Dec 19 '22 at 01:51
@ angravian: thank you for your reply! I was hoping to get insights from a more mathematical perspective (compared to a statistical perspective) and thus decided to post it here. thank you so much! — stats_noob, Dec 19 '22 at 01:54
If you have a decent sample of data, the empirical distribution of your data is already close to the true distribution (rigorously, this is the Glivenko-Cantelli theorem https://en.wikipedia.org/wiki/Glivenko%E2%80%93Cantelli_theorem). Because the empirical distribution is close to the true distribution, you can estimate a large class of probabilities for the true distribution by estimating them for the empirical distribution. Bootstrapping lets you do independent sampling inside that approximate distribution to compute statistics of interest. — Chris Janjigian, Dec 19 '22 at 02:58
@ Chris Janjigian : Thank you for your reply! I will look into this! — stats_noob, Dec 19 '22 at 05:33

score 11 · Accepted Answer · edited Dec 21 '22 at 16:29

The bootstrap works because, by resampling from the sample, you are actually sampling from the empirical distribution of the data.

To take a step back, imagine that you have a situation where the distribution of the population is known. Each of the draws from the population in the random sample can be represented as a random variable, $Y_i$, whose distribution is given by a CDF $F_Y$. We can represent a random sample as a random vector $\mathbf{Y} = (Y_1,Y_2,\ldots, Y_n)$, and so the statistic you are interested in calculating is 1) a function of this random vector and 2) itself a random variable with its own distribution. Denote this statistic $T(\mathbf{Y})$ or $T$. Attempting to derive mathematically the distribution of $T$ is likely not feasible or at least not an easy task in general. It would be much easier if, instead, you could approximate the distribution of $T$ (or the mean or the standard deviation of $T$). To accomplish this, you would draw $M$ random samples of size $n$, $\mathbf{Y}^{(m)} = (Y_1^{(m)}, \ldots, Y_n^{(m)})$, from $F_Y$ and calculate the corresponding statistic $T^{(m)} = T(\mathbf{Y}^{(m)})$ for each random sample. This way, you have $M$ replications of $T$ which you can then use to approximate various properties of the distribution of $T$ (you could compute the mean of the $T^{(m)}$, for example).

Now, in practice, we do not actually know the true distribution of the population we are studying. What's the next best thing? The empirical distribution. Given observed data $(y_1,\ldots,y_n)$, the empirical cumulative density function (ECDF) of the data is given, in terms of indicator variables, as: $$\hat{F}(y) = \frac{1}{n}\sum_{j=1}^n I(y_j \leq y)$$ The ECDF is a step function which jumps to the value $k/n$, where $k$ is the number of the data points less than or equal to $y$. It turns out that, assuming the population has a true distribution $F$, the strong law of large numbers implies that $$\lim_{n \to \infty} \hat{F}(y) = P(Y \leq y)= F(y) \text{ with probability 1,} $$

for all $y \in \mathbb{R}$. This is what I think your professor was alluding to. This fact also makes us comfortable believing that the empirical distribution approximates the true distribution well (given an appropriately large sample size). So, we can use the procedure I described in the second paragraph, except with $F$ replaced by $\hat{F}$. Also, it turns out that taking $n$ random draws from $\hat{F}$ is the exact same thing as drawing a random sample with replacement of size $n$ from the data!

@ Issac: Thank you so much for your answer - I will spend some time reading this and will let you know if I have any questions! — stats_noob, Dec 21 '22 at 05:56
https://math.stackexchange.com/questions/4600837/calculating-the-spreads-for-different-outcomes-in-dice-rolls — stats_noob, Dec 21 '22 at 05:56
https://math.stackexchange.com/questions/4600425/why-is-the-fisher-information-important — stats_noob, Dec 21 '22 at 05:57
If you have time, could you please take a look at these? Do you have any ideas about these questions? Thank you so much for all your help! — stats_noob, Dec 21 '22 at 05:57
No problem @stats_noob ! I'll take a look at those and see if I have any ideas. Btw, you can accept an answer you like :) — Isaac, Dec 21 '22 at 19:05
OUTSTANDING question-answer pair. What an amazing reference this is. — Sarvesh Ravichandran Iyer, Dec 23 '22 at 10:12
I think it is probably worth editing this answer to include Glivenko-Cantelli. The pointwise almost sure convergence of the ECDF to the (true) CDF is not very convincing for this argument, and one really does need the stronger result. — Andrew, Apr 17 '23 at 21:25

Why Does The "Bootstrap Method" Work?

1 Answers1