15  Bootstrap

CLT provides a useful approach to building confidence intervals and performing hypothesis testing. However, it does not always apply. Here we provide a short introduction to an alternative approach to estimating the distribution of an estimate that does not rely on CLT.

15.1 Example: median income

Suppose the income distribution of your population is as follows:

set.seed(1995)
n <- 10^6
income <- 10^(rnorm(n, log10(45000), log10(3)))
hist(income/10^3, nclass = 1000)

The population median is:

median(income)
#> [1] 44939

Suppose we don’t have access to the entire population but want to estimate the population median, denoted by \(m\). We take a random sample of 100 observations and use the sample median, \(M\), as our estimate of \(m\):

N <- 100
x <- sample(income, N)
median(x)
#> [1] 38461

The question now becomes: how can we assess the uncertainty in this estimate? In other words, how do we compute a standard error and construct a confidence interval for \(m\) based on our sample?

In the following sections, we introduce the bootstrap, a powerful resampling method that allows us to estimate variability and construct confidence intervals without relying on strong distributional assumptions.

15.2 Confidence intervals for the median

Can we construct a confidence interval? What is the distribution of \(M\)?

Because we are simulating the data, we can use a Monte Carlo simulation to learn the distribution of \(M\).

B <- 10^4
m <- replicate(B, {
  x <- sample(income, N)
  median(x)
})
hist(m, nclass = 30)
qqnorm(scale(m)); abline(0,1)

If we know this distribution, we can construct a confidence interval. The problem here is that, as we have already described, in practice we do not have access to the distribution. In previous sections we have used CLT, but what we learned applies to averages and here we are interested in the median. We can see that the 95% confidence interval based on CLT

median(x) + 1.96*sd(x)/sqrt(N)*c(-1, 1)
#> [1] 21018 55905

is quite different from the confidence interval we would generate if we know the actual distribution of \(M\):

quantile(m, c(0.025, 0.975))
#>  2.5% 97.5% 
#> 34438 59050

The bootstrap allows us to approximate a Monte Carlo simulation even when we do not have access to the full population distribution. The idea is straightforward: we treat the observed sample as if it were the population. From this sample, we draw new datasets of the same size, with replacement, and compute the statistic of interest, in this case, the median, for each resampled dataset. These resampled datasets are called bootstrap samples.

In many practical situations, the distribution of the statistics computed from bootstrap samples provides a good approximation to the sampling distribution of the original statistic. This approximation allows us to estimate variability, compute standard errors, and construct confidence intervals without knowing the true underlying distribution.

The following code demonstrates how to generate bootstrap samples and approximate the sampling distribution of the median:

B <- 10^4
m_star <- replicate(B, {
  x_star <- sample(x, N, replace = TRUE)
  median(x_star)
})

Note a confidence interval constructed with the bootstrap is much closer to one constructed with the theoretical distribution:

quantile(m_star, c(0.025, 0.975))
#>  2.5% 97.5% 
#> 30253 56909

15.3 Exercises

1. Generate a random dataset like this:

y <- rnorm(100, 0, 1)

Estimate the 75th quantile, which we know is:

qnorm(0.75)

with the sample quantile:

quantile(y, 0.75)

Run a Monte Carlo simulation to learn the expected value and standard error of this random variable.

2. In practice, we can’t run a Monte Carlo simulation because we don’t know if rnorm is being used to simulate the data. Use the bootstrap to estimate the standard error using just the initial sample y. Use 10 bootstrap samples.

3. Redo exercise 2, but with 10,000 bootstrap samples.