R8. Topics in bootstrapping

This recitation was prepared by Cecelia Andrews with help from Justin Bois. Modifications were made by Sanjana Kulkarni, Rashi Jeeda, and Grace Liu.

[2]:

import pandas as pd
import numpy as np
import numba
import scipy.stats as st
import bebi103

import iqplot
import timeit
import colorcet

import bokeh.io
import bokeh.plotting
import bokeh.layouts
bokeh.io.output_notebook()

Loading BokehJS ...

In this recitation, we will cover

Review of bootstrapping concepts from Lessons 14 and 18.
BCa Confidence Intervals
Studentized Bootstrap

An excellent reference, if dated, for bootstrapping calculations is Efron and Tibshirani’s An Introduction to the Bootstrap, 1994 (Amazon).

1. Review

What do we need to understand before thinking about bootstrapping?

The plug-in principle

This was covered in Lesson 14.

The plug-in-principle is important theory for understanding why bootstrapping is helpful.

Simplified definition:

Say we have a sample of repeated measurements, and we know there is a generative distribution for these measurements. But, we don’t know what that generative distribution is. We can estimate a parameter, \(\theta = T(F)\), of the generative distribution by computing the statistical functional using the empirical CDF rather than the actual CDF.

Thorough definition:

Say we have a data set of repeated measurements of the lengths of C. elegans eggs. We know there is a generative distribution for these lengths, but we don’t know any parameters of the distribution. How can we estimate a parameter of the distribution?

Call the CDF of our generative distribution \(F(x)\). We want to compute a statistical functional, \(T(F)\), of the CDF. \(T(F)\) could be the mean, variance, median, etc. The parameter \(\theta\) is defined by the functional, so \(\theta = T(F)\).

However, we only have a sample from the generative distribution (our egg length measurements). We can compute the ECDF of our measurements, called \(\hat{F}\). The plug-in principle simply states we can estimate our parameter of interest, \(\theta\), by calculating:

\begin{align} \hat{\theta} &= T(\hat{F}) \end{align}

Essentially, the plug-in principle says we can estimate the parameter \(\theta\) by computing the statistical functional on the empirical CDF rather than the actual CDF.

Why and how do we bootstrap?

When we are trying to learn something from data, it becomes easier to estimate accurate statistics from our results as we gather more data that is representative of our population. As we increase our number of samples, we capture more and more samples from the generative distribution and our picture of it becomes more clear. However, it is often not reasonably possible to sample an entire population to learn about something we’re interested in. In these cases, we have to settle for having only a subset of all the possible data.

However, bootstrapping can help us simulate extra runs of an experiment and act as a way to artificially obtain more samples! The basic idea of bootstrapping is to simulate many repetitions of the experiment. It is a type of resampling, in which you draw samples from your sample to generate a bootstrap distribution. For example, if you have a sample of 10 data points, you can generate more samples by randomly choosing from the original set of 10 points, with replacement.

Bootstrapping allows us to get a better plug-in estimate for the statistical functional as if we did the experiment many times.

Let’s try bootstrapping!

To see the power of bootstrapping in action, let’s consider the example given in Efron and Tibshirani’s An Introduction to the Bootstrap.

The CSV file law_school_data.csv, available here, contains the population of measurements (LSAT, GPA) for N=82 American law schools for the entering classes of 1973. LSAT is the average score of the class on the LSAT (national law exam), and GPA is the average undergraduate GPA. In this example, we have a census, so we know the actual values of the summary statistics we may want to compute. Here we load in the data and view the first 10 observations.

[3]:

df_law = pd.read_csv(os.path.join(data_path, 'law_school_data.csv'))
df_law.head(10)

[3]:

	School	LSAT	GPA
0	1	622	3.23
1	2	542	2.83
2	3	579	3.24
3	4	653	3.12
4	5	606	3.09
5	6	576	3.39
6	7	620	3.10
7	8	615	3.40
8	9	553	2.97
9	10	607	2.91

Now, let’s pretend we are a researcher interested in estimating the mean LSAT and GPA for American law schools. It would be difficult to survey every law school in the USA and collect LSAT and GAP scores of each class, so in practice, researchers collect a sample, a subset of the population, and make inferences about the population. Say we collect a random sample of 15 law schools and want to estimate the population mean from our sample. We will simulate this by drawing a random sample of size n=15:

[4]:

# Set the random seed
rg = np.random.default_rng(seed=3252)

# Draw 15 random samples and make a new dataframe
indices = rg.choice(np.arange(82), replace=False, size=15)
df_sample = df_law.loc[indices]
df_sample

[4]:

	School	LSAT	GPA
21	22	614	3.19
51	52	580	3.07
11	12	596	3.24
17	18	646	3.47
2	3	579	3.24
10	11	558	3.11
33	34	591	3.02
3	4	653	3.12
13	14	581	3.22
36	37	615	3.37
34	35	578	3.03
52	53	594	2.96
16	17	599	3.23
70	71	570	3.01
49	50	555	3.00

How does this sample compare to the population? Let’s compare the population mean LSAT and GPA to the sample means:

[5]:

# Print means
print(f'Sample mean LSAT: {np.mean(df_sample.LSAT)}')
print(f'Sample mean GPA: {np.mean(df_sample.GPA)}')
print(f'Population mean LSAT: {np.mean(df_law.LSAT)}')
print(f'Population mean GPA: {np.mean(df_law.GPA)}')

Sample mean LSAT: 593.9333333333333
Sample mean GPA: 3.1519999999999997
Population mean LSAT: 597.5487804878048
Population mean GPA: 3.1347560975609756

[6]:

p = bebi103.viz.confints(
    [
        {
            "estimate": np.mean(df_sample.LSAT),
            "conf_int": [np.mean(df_sample.LSAT), np.mean(df_sample.LSAT)],
            "label": "Sample mean LSAT",
        },
        {
            "estimate": np.mean(df_law.LSAT),
            "conf_int": [np.mean(df_law.LSAT), np.mean(df_law.LSAT)],
            "label": "Population mean LSAT",
        },
    ],
)

# Set plot range to all possible LSAT scores from the time
p.x_range = bokeh.models.Range1d(200, 800)

q = bebi103.viz.confints(
    [
        {
            "estimate": np.mean(df_sample.GPA),
            "conf_int": [np.mean(df_sample.GPA), np.mean(df_sample.GPA)],
            "label": "Sample mean GPA",
        },
        {
            "estimate": np.mean(df_law.GPA),
            "conf_int": [np.mean(df_law.GPA), np.mean(df_law.GPA)],
            "label": "Population mean GPA",
        },
    ]
)

# Set plot range to all possible GPA values
q.x_range = bokeh.models.Range1d(0, 4)

bokeh.io.show(bokeh.layouts.column(p, bokeh.models.Spacer(height=20), q))

The accuracy of the mean LSAT and GPA of the small sample will vary based on the sample you draw, but they can be fairly far away from the mean of the population such that the standard error does not capture the actual mean.

Let’s compare the ECDFs as well. To overlay the sample and population ECDFs, we will use manually compute the ECDFs of the samples and populations and overlay them on the population plots. This is easier to do since the sample and population are of different sizes.

[7]:

def compute_ecdf(data):
    """Returns ECDF values of data"""
    return np.arange(1, len(data) + 1) / len(data)


# Initialize two bokeh plots using a list comprehension
labels = ["LSAT", "GPA"]
p_LSAT, p_GPA = [
    bokeh.plotting.figure(
        height=300, width=400, x_axis_label=labels[i], y_axis_label="ECDF"
    )
    for i in range(2)
]

# Compute and plot the ECDFs for LSAT
p_LSAT.circle(
    np.sort(df_law.LSAT), compute_ecdf(df_law.LSAT), legend_label="Population"
)

p_LSAT.circle(
    np.sort(df_sample.LSAT),
    compute_ecdf(df_sample.LSAT),
    legend_label="Sample",
    color="darkorange",
    size=6,
)

# Compute and plot the ECDFs for GPA
p_GPA.circle(
    np.sort(df_law.GPA), compute_ecdf(df_law.GPA), legend_label="Population",
)

p_GPA.circle(
    np.sort(df_sample.GPA),
    compute_ecdf(df_sample.GPA),
    legend_label="Sample",
    color="darkorange",
    size=6,
)

p_LSAT.legend.location = p_GPA.legend.location = "top_left"

bokeh.io.show(bokeh.layouts.column(p_LSAT, bokeh.models.Spacer(height=20), p_GPA))

The ECDFs match up fairly well, but if we zoom in on the 50th percentile we can see the medians may not overlap, and the means are probably off a bit, too. Let’s see if we can compute a confidence interval through bootstrapping. We’ll use the function draw_bs_sample() to draw bootstrap samples of the same size as our sample (n=15), and compute the means of each to create 1000 replicates. Remember, our bootstrap samples are always the same size as the original sample because we want to simulate repeating the same experiment.

[8]:

@numba.njit
def draw_bs_sample(data):
    """Draw a bootstrap sample from a 1D data set."""
    return np.random.choice(data, size=len(data))


# Number of bootstrap replicates we want to draw
N = 1000

sample_lsat_mean_reps = np.empty(N)
sample_gpa_mean_reps = np.empty(N)

# Turn the sample LSAT and GPA scores into arrays for use with numba
sample_lsat = np.array(df_sample.LSAT)
sample_gpa = np.array(df_sample.GPA)

for i in range(N):
    sample_lsat_mean_reps[i] = np.mean(draw_bs_sample(sample_lsat))
    sample_gpa_mean_reps[i] = np.mean(draw_bs_sample(sample_gpa))

p_LSAT_conf = bebi103.viz.confints(
    [
        {
            "estimate": np.mean(sample_lsat),
            "conf_int": np.percentile(sample_lsat_mean_reps, [2.5, 97.5]),
            "label": "Sample mean LSAT",
        },
        {
            "estimate": np.mean(df_law.LSAT),
            "conf_int": [np.mean(df_law.LSAT), np.mean(df_law.LSAT)],
            "label": "Population mean LSAT",
        },
    ],
)
p_LSAT_conf.x_range = bokeh.models.Range1d(200, 800)

p_GPA_conf = bebi103.viz.confints(
    [
        {
            "estimate": np.mean(sample_gpa),
            "conf_int": np.percentile(sample_gpa_mean_reps, [2.5, 97.5]),
            "label": "Sample mean GPA",
        },
        {
            "estimate": np.mean(df_law.GPA),
            "conf_int": [np.mean(df_law.GPA), np.mean(df_law.GPA)],
            "label": "Population mean GPA",
        },
    ]
)
p_GPA_conf.x_range = bokeh.models.Range1d(0, 4)

bokeh.io.show(
    bokeh.layouts.column(p_LSAT_conf, bokeh.models.Spacer(height=20), p_GPA_conf)
)

The 95% confidence interval captures the true mean for both the LSAT and GPA data. Let’s compare the ECDFs again. We will create 500 bootstrap sample ECDFs and overlay them over the population ECDF. Note: if 500 samples takes too long to render on your machine, you can change the variable n to something like 100. You will still see the same overall effect.

[9]:

# Initialize two bokeh plots using a list comprehension
labels = ["LSAT", "GPA"]
p_LSAT, p_GPA = [
    bokeh.plotting.figure(
        height=300, width=400, x_axis_label=labels[i], y_axis_label="ECDF"
    )
    for i in range(2)
]

n = 500
# Make n bootstrap samples and plot them
for i in range(n):
    lsat = draw_bs_sample(sample_lsat)
    gpa = draw_bs_sample(sample_gpa)
    # Add semitransparent ECDFs to the plot
    p_LSAT.circle(
        np.sort(lsat),
        compute_ecdf(lsat),
        color=colorcet.b_glasbey_category10[0],
        alpha=0.05,
    )
    p_GPA.circle(
        np.sort(gpa),
        compute_ecdf(gpa),
        color=colorcet.b_glasbey_category10[0],
        alpha=0.05,
    )

p_LSAT.circle(
    np.sort(df_law.LSAT),
    compute_ecdf(df_law.LSAT),
    color=colorcet.b_glasbey_category10[1],
)
p_GPA.circle(
    np.sort(df_law.GPA),
    compute_ecdf(df_law.GPA),
    color=colorcet.b_glasbey_category10[1],
)

bokeh.io.show(bokeh.layouts.row(p_LSAT, p_GPA))

What do you notice about the bootstrapped ECDFs? Consider the approximate shape and mean.

The boostrapped samples are a pretty good approximation of the overall population distribution.

Now that we’ve shown the power of bootstrapping, let’s explore some more bootstrapping methods! First we’ll review pairs bootstrapping, which we learned about in class, then go into some new methods.

How can we deal with paired data?

Paired bootstrap

Also covered in Lesson 16. See the lesson for a coding example.

We use pairs bootstrapping when we have paired data, and we will often want to compute the correlation. The important thing to remember with pairs bootstrapping is that we draw a random sample of pairs, not 2 random samples of each array. For example, if we have paired data in 2 arrays: \(\mathbf{x} = [x_1, x_2,\ldots, x_n]\) and \(\mathbf{y} = [y_1, y_2,\ldots, y_n]\), we must draw the same indices from both arrays so we end up with paired samples:

\begin{align} &\mathbf{\hat{x}} = [x_i, x_j, x_k,\ldots, x_n], \text{ and}\\[1em] &\mathbf{\hat{y}} = [y_i, y_j, y_k,\ldots, y_n]. \end{align}

Question: Why do we have to keep pairs together to estimate correlation? What result will you likely get for the correlation if your pairs get separated?

Recall that

\[\text{Corr(X, Y)} = \frac{\sum_{i=1}^n (x_i-\overline{x})(y_i-\overline{y})}{n\sigma_x\sigma_y}\]

Let’s explore this question with the bee sperm data set from lesson 15. It has been noted that this data showed a positive correlation between the number of dead sperm and the number of alive sperm for both control and pesticide-treated bees. We will explore the effect of unpairing data for just the control bees, and see how this affects the estimated correlation values.

[10]:

# Load the dataset
df = pd.read_csv(os.path.join(data_path, "bee_sperm.csv"), comment='#')
df.head()

[10]:

	Specimen	Treatment	Environment	TreatmentNCSS	Sample ID	Colony	Cage	Sample	Sperm Volume per 500 ul	Quantity	ViabilityRaw (%)	Quality	Age (d)	AliveSperm	Quantity Millions	Alive Sperm Millions	Dead Sperm Millions
0	227	Control	Cage	1	C2-1-1	2	1	1	2150000	2150000	96.7263814616756	96.726381	14	2079617	2.1500	2.079617	0.070383
1	228	Control	Cage	1	C2-1-2	2	1	2	2287500	2287500	96.3498079760595	96.349808	14	2204001	2.2875	2.204001	0.083499
2	229	Control	Cage	1	C2-1-3	2	1	3	87500	87500	98.75	98.750000	14	86406	0.0875	0.086406	0.001094
3	230	Control	Cage	1	C2-1-4	2	1	4	1875000	1875000	93.2874208336941	93.287421	14	1749139	1.8750	1.749139	0.125861
4	231	Control	Cage	1	C2-1-5	2	1	5	1587500	1587500	97.7925061050061	97.792506	14	1552456	1.5875	1.552456	0.035044

[11]:

# Extract NumPy arrays (only use values greater than zero for logs)
inds = (df["Alive Sperm Millions"] > 0) & (df["Dead Sperm Millions"] > 0)

alive_ctrl = df.loc[
    (inds) & (df["Treatment"] == "Control"), "Alive Sperm Millions"
].values

dead_ctrl = df.loc[
    (inds) & (df["Treatment"] == "Control"), "Dead Sperm Millions"
].values

Let’s start by making an exploratory scatter plot of the dead sperm vs. alive sperm based on the control data, taking the log on both axes.

[12]:

p_data = bokeh.plotting.figure(
    frame_width=300,
    frame_height=300,
    x_axis_label="Alive sperm log(millions)",
    y_axis_label="Dead sperm log(millions)",
)
log_alive_ctrl = np.log(alive_ctrl)
log_dead_ctrl = np.log(dead_ctrl)
p_data.circle(log_alive_ctrl, log_dead_ctrl, alpha=1)

bokeh.io.show(p_data)

Based on this plot, we can see that there seems to be a positive correlation between the two variables. Now, let’s move on to our bootstrapping. We will define the same functions we defined in that lesson for drawing bootstrap samples, and write a new function to calculate the bivariate covariance for each sample. To draw pairs bootstraps correctly, we should randomly select indices to be used in each bootstrapped sample, and get both the x and the y values at those indices to ensure that all samples have corresponding x and y values.

[13]:

@numba.njit
def bivariate_r(x, y):
    """
    Compute plug-in estimate for the bivariate correlation coefficient.
    """
    return (
        np.sum((x - np.mean(x)) * (y - np.mean(y)))
        / np.std(x)
        / np.std(y)
        / np.sqrt(len(x))
        / np.sqrt(len(y))
    )

@numba.njit
def draw_bs_sample(data):
    """Draw a bootstrap sample from a 1D data set."""
    return np.random.choice(data, size=len(data))


@numba.njit
def draw_bs_pairs(x, y):
    """Draw a pairs bootstrap sample."""
    inds = np.arange(len(x))
    bs_inds = draw_bs_sample(inds)

    return x[bs_inds], y[bs_inds]

@numba.njit
def draw_bs_pairs_reps_bivariate(x, y, size=1):
    """
    Draw bootstrap pairs replicates.
    """
    out = np.empty(size)

    for i in range(size):
        out[i] = bivariate_r(*draw_bs_pairs(x, y))

    return out

Now, let’s also add functions that will draw bootstrap replicates incorrectly. That is, these functions will draw a random sample from each array describing a dimension of the paired data, and the resulting samples of x and y will be non-corresponding.

[14]:

@numba.njit
def draw_bs_pairs_wrong(x, y):
    """Draw a pairs bootstrap sample."""

    bs_inds_x = draw_bs_sample(x)
    bs_inds_y = draw_bs_sample(y)

    return bs_inds_x, bs_inds_y



@numba.njit
def draw_bs_pairs_reps_bivariate_wrong(x, y, size=1):
    """
    Draw bootstrap pairs replicates.
    """
    out = np.empty(size)

    for i in range(size):
        out[i] = bivariate_r(*draw_bs_pairs_wrong(x, y))

    return out

Let’s draw replicates using both methods.

[15]:

# Get reps
bs_reps_ctrl = draw_bs_pairs_reps_bivariate(
    np.log(alive_ctrl), np.log(dead_ctrl), size=10000
)

bs_reps_ctrl_wrong = draw_bs_pairs_reps_bivariate_wrong(
    np.log(alive_ctrl), np.log(dead_ctrl), size=10000
)

Now, let’s plot these replicates.

[16]:

p_corr = bokeh.plotting.figure(
    width=600,
    height=400,
    x_axis_label="bivariate correlation coeff.",
    y_axis_label="ECDF",
)
p_corr.circle(
    np.sort(bs_reps_ctrl), compute_ecdf(bs_reps_ctrl), legend_label="Indices preserved"
)
p_corr.circle(
    np.sort(bs_reps_ctrl_wrong),
    compute_ecdf(bs_reps_ctrl),
    color="tomato",
    legend_label="Indices not preserved",
)

p_corr.add_layout(p_corr.legend[0], "right")
bokeh.io.show(p_corr)

When we did not preserve indices, we got a distribution of correlation coefficients centered at 0. This tells us we did indeed lose all correlation in the data. Let’s see what it actually looks like when we plot our bootstrap replicate data points.

[17]:

# Get reps
bs_reps_ctrl_wrong = np.array(
    [draw_bs_pairs_wrong(np.log(alive_ctrl), np.log(dead_ctrl)) for i in range(10)]
)

ctrl_x_wrong = bs_reps_ctrl_wrong[:, 0]
ctrl_y_wrong = bs_reps_ctrl_wrong[:, 1]

bs_reps_ctrl = np.array(
    [draw_bs_pairs(np.log(alive_ctrl), np.log(dead_ctrl)) for i in range(10)]
)

ctrl_x = bs_reps_ctrl[:, 0]
ctrl_y = bs_reps_ctrl[:, 1]

p_data = bokeh.plotting.figure(
    frame_width=300,
    frame_height=300,
    x_axis_label="Alive sperm log(millions)",
    y_axis_label="Dead sperm log(millions)",
    toolbar_location='above',
)
for i in range(len(ctrl_x)):
    p_data.circle(ctrl_x[i], ctrl_y[i], alpha=0.1, legend_label="Indices preserved")
    p_data.circle(
        ctrl_x_wrong[i],
        ctrl_y_wrong[i],
        alpha=0.1,
        color="tomato",
        legend_label="Indices not preserved",
    )

p_data.add_layout(p_data.legend[0], "right")
bokeh.io.show(p_data)

With even just a tiny number of bootstrap samples, we can see that we preserve the correlation in the original data when we bootstrap by index. However, we completely lose the correlation when indices are not preserved, and the resulting data makes no sense when compared to our original exploratory plot.

2. BCa Confidence Intervals

Use BCa when… your dataset is large and you expect that there are high-order effects causing errors in your estimates.

The BCa confidence interval is the bias-corrected accelerated bootstrap confidence interval. This concept was introduced by Bradley Efron. The BCa interval is designed to automatically create confidence intervals that adjust for underlying higher order effects. In simple terms, it corrects for bias and skew in the distribution of bootstrap estimates.

In this recitation, we will focus on the nonparametric case and show a coding example. If you want to dive into this concept even more, which I highly recommend, take a look at Efron, “Better Bootstrap Confidence Intervals.” Journal of the American Statistical Association 82, no. 397 (1987): 171-85.

Let’s consider the C. elegans egg data example again. We have a dataset of repeated measurements of the areas C. elegans eggs. We know there is a generative distribution for this data, but we don’t know what it is. We want to compute a confidence interval of the mean. For the general case, we’ll call the statistic we want to estimate \(\theta\). As we know, the standard confidence interval is given by:

\begin{align} \theta \in [\hat{\theta} + \hat{\sigma}z^{(\alpha)}, \hat{\theta} + \hat{\sigma}z^{(1 - \alpha)}] \tag{2.1} \end{align}

where \(\hat{\theta}\) is our sample estimate of \(\theta\), \(\hat{\sigma}\) is our estimate of the standard deviation of the distribution of \(\hat{\theta}\), and \(z^{(\alpha)}\) is the \(100 \times \alpha\) percentile point of a standard normal variate.

This standard confidence interval (Equation 2.1) is based on taking the asymptotic normal approximation

\begin{align} (\hat{\theta} - \theta)/\hat{\sigma} \sim N(0, 1) \tag{2.2} \end{align}

In our case, and in many cases, we assume there is a generative distribution, \(g\), for our data. Thus, we can achieve normality and constant standard error by considering the monotone transformation \(\hat{\phi} = g(\hat{\theta})\) and \(\phi = g(\theta)\) instead of \(\hat{\theta}\) and \(\theta\). We assume this transformation is both normalizing and variance stabilizing (defined momentarily). Through this method, we can now compute the confidence interval for \(\phi\) based on:

\begin{align} (\hat{\phi} - \phi)/\tau \sim N(-z_0, 1) \tag{2.3} \end{align}

where \(\tau\) is the constant standard error of \(\hat{\phi}\) and \(z_0\) is the bias constant. To get the confidence interval for \(\theta\), we take the inverse transformation \(\theta = g^{-1}(\phi)\). The great part: all of this is achieved automatically simply through bootstrapping! We do not need to know \(g\) or \(z_0\).

How can we improve on this? That’s where BCa comes in. Like we described previously, through bootstrapping alone we are assuming the transformation \(g\) is both normalizing and variance stabilizing. A variance stabilizing transformation is a transformation such that the variance is no longer related to the mean. For some distributions (ex. Binomial), the variance is dependent on the mean. A variance stabilizing transformation removes this dependence. But, since we do not know the transformation, we do not know for sure that it is variance stabilizing. In the BCa confidence interval, we only need to know that \(g\) is normalizing. This makes it more accurate in many cases!

The BCa incorporates some bias constant, \(z_0\), and some acceleration constant, \(a\), such that the BCa confidence interval is based on:

\begin{align} &(\hat{\phi} - \phi)/\tau \sim N(-z_0\sigma_\phi, \sigma_\phi^2),\\[1em] &\sigma_\phi = 1 + a\phi. \tag{2.4} \end{align}

We can determine \(a\) and \(z_0\) by resampling our data. Here, we are considering the nonparametric case. For the nonparametric case, the acceleration constant for the mean is given by

\begin{align} a = \frac{\sum_i (x_i - \bar{x})^3}{6\left(\sum_i (x_i - \bar{x})^2\right)^{3/2}}. \tag{2.5} \end{align}

We can compute \(z_0\) from our bootstrap replicates. The bias-correction constant, \(z_0\), is:

\begin{align} z_0 = \phi^{-1}(\hat{G}(\hat{\theta})) \tag{2.6} \end{align}

From the plug-in principle, we know that for the nonparametric case we can approximate the CDF \(\hat{G}(\hat{\theta})\) as the ECDF of our bootstrap replicates. So, \(\phi^{-1}(\hat{G}(\hat{\theta}))\) is given by the inverse ECDF of our bootstrap replicates.

This may all be a little confusing now, so let’s practice with the worm egg example! I will review how to compute each parameter as we go.

Let’s load in our worm egg area dataset, digitized from Harvey and Orbidans’s 2011 PLoS One paper.

[18]:

df_worms = pd.read_csv(os.path.join(data_path, 'c_elegans_egg_xa.csv'), comment='#')
df_worms.head()

[18]:

	food	area (sq. um)
0	high	1683
1	high	2061
2	high	1792
3	high	1852
4	high	2091

And we show an ECDF to get a feel for the data.

[19]:

p = iqplot.ecdf(data=df_worms, q="area (sq. um)", cats="food")

bokeh.io.show(p)

It looks like worms that received more food have smaller eggs. Any hypotheses for why that may be the case?

We’ll convert the data we’re interested in into arrays, separated by how much food the worms received:

[20]:

fed_worms = df_worms.loc[df_worms['food'] == 'high']['area (sq. um)'].values
hungry_worms = df_worms.loc[df_worms['food'] == 'low']['area (sq. um)'].values

We can first calculate the confidence intervals using the standard bootstrapping method.

[21]:

def bootstrap_replicates(data, n_reps=1000, statfunction=np.mean):
    """Computes n_reps number of bootstrap replicates for
    given data and statistical function"""
    boot_reps = np.empty(n_reps)
    for i in range(n_reps):
        boot_reps[i] = statfunction(draw_bs_sample(data))
    return boot_reps

[22]:

fed_bs_reps = bootstrap_replicates(fed_worms)
hungry_bs_reps = bootstrap_replicates(hungry_worms)

fed_ci = np.percentile(fed_bs_reps, [2.5, 97.5])
hungry_ci = np.percentile(hungry_bs_reps, [2.5, 97.5])

print(f"Fed worms mean egg area 95% C.I.: {fed_ci}")
print(f"Hungry worms mean egg area 95% C.I.: {hungry_ci}")

Fed worms mean egg area 95% C.I.: [1785.29034091 1857.72954545]
Hungry worms mean egg area 95% C.I.: [2054.77807018 2131.94780702]

Here we write some helper functions to calculate the parameters for the BCa interval. We will combine them in our main function, compute_bca_ci().

To compute a, we defined a function to compute jackknife replicates, which takes in samples of the original data and calculates the test statistic of interest. Specifically, it loops over a dataset and calculates the leave one out (LOO) test statistic when each observation is removed. Here, we are interested in finding the mean, so we set statfunction as np.mean. We also have a function for calculating z by computing the inverse normal CDF.

[23]:

# Define "statfunction" as the function we want to apply to the replicates.


def compute_jackknife_reps(data, statfunction=np.mean):
    """Returns jackknife resampled replicates for
    the given data and statistical function"""
    # Set up empty array to store jackknife replicates
    jack_reps = np.empty(len(data))

    # For each observation in the dataset, compute the statistical function on the sample
    # with that observation removed
    for i in range(len(data)):
        jack_sample = np.delete(data, i)
        jack_reps[i] = statfunction(jack_sample)
    return jack_reps


def compute_a(jack_reps):
    """Returns the acceleration constant a"""
    mean = np.mean(jack_reps)
    return (1 / 6) * np.divide(
        np.sum(mean - jack_reps) ** 3, (np.sum(mean - jack_reps) ** 2) ** (3 / 2)
    )


def compute_z0(data, boot_reps, statfunction=np.mean):
    """Computes z0 for given data and statistical function"""
    s = statfunction(data)
    return st.norm.ppf(np.sum(boot_reps < s) / len(boot_reps))


def compute_bca_ci(data, alpha_level, n_reps=1000, statfunction=np.mean):
    """Returns BCa confidence interval for given data at given alpha level"""
    # Compute bootstrap and jackknife replicates
    boot_reps = bootstrap_replicates(data, n_reps, statfunction)
    jack_reps = compute_jackknife_reps(data, statfunction)

    # Compute a and z0
    a = compute_a(jack_reps)
    z0 = compute_z0(data, boot_reps, statfunction)

    # Compute confidence interval indices
    alphas = np.array([alpha_level / 2.0, 1 - alpha_level / 2.0])
    zs = z0 + st.norm.ppf(alphas).reshape(alphas.shape + (1,) * z0.ndim)
    avals = st.norm.cdf(z0 + zs / (1 - a * zs))
    ints = np.round((len(boot_reps) - 1) * avals)
    ints = np.nan_to_num(ints).astype("int")

    # Compute confidence interval
    boot_reps = np.sort(boot_reps)
    ci_low = boot_reps[ints[0]]
    ci_high = boot_reps[ints[1]]
    return (ci_low, ci_high)

We can now compute the BCa confidence intervals.

[24]:

print(f"Fed worms mean egg area 95% C.I.: {compute_bca_ci(fed_worms, 0.05)}")
print(f"Hungry worms mean egg area 95% C.I.: {compute_bca_ci(hungry_worms, 0.05)}")

Fed worms mean egg area 95% C.I.: (1791.1818181818182, 1871.9318181818182)
Hungry worms mean egg area 95% C.I.: (2065.4210526315787, 2150.3508771929824)

Comparing the BCa and the standard bootstrap confidence intervals, we can see that the BCa confidence intervals are wider, especially on the right side.

3. Studentized Bootstrap

Use the Studentized Bootstrap when… your dataset is small and you know that there is a large amount of measurements you are missing.

The Studentized bootstrap error aims to reduce coverage error of our resampling. It is a good idea to use Studentized bootstrapping to approximate your statistic of interest when you suspect the distribution of your sample may not be close to the actual population distribution. The Studentized bootstrapping method works because even when this is the case, the distribution of the sample statistic divided by estimated standard error could be close to the actual population statistic divided by the population standard error. This the same principle employed in the Student-t test.

The Studentized bootstrap involves taking inner and outer bootstrap samples. The algorithm is as follows:

Compute \(\hat{\theta}\), the sample test statistic
A large number of times (number of reps ≈1000), draw a bootstrap sample. Call this the outer sample:
1. Compute the statistical functional on the sample
2. Resample your bootstrap sample ≈50 to 100 times. Call this the inner sample. Compute the statistical functional from the inner sample.
3. For your inner sample replicates, compute the standard deviation
Compute your statistic of interest from your replicates: t = (outer sample replicates) / (standard deviation of the inner sample replicates)
Compute s = standard deviation of your outer sample replicates
Multiply t by s
Compute the confidence interval from the \(\alpha/2\) and \((1 - \alpha)/2\) percentiles of t * s.

Let’s try this out on a dataset of the repair times of landline phones in New York City. The New York Public Utilities Commission monitors the response time for repairing landline phones in the state, and repair times are dependent on the time of year, type of repair, and geographical location. The data here are for one class of repairs at one time period for a specific Incumbent Local Exchange Carrier, which is a telephone company which once held the regional monopoly on landline service.

This times come from a particular time and region of New York City, so we expect them to not be representative of the population of all phone repair times ever recorded in New York City. First, we read in the data (which I obtained from the online textbook for UC Berkeley’s Data 100 class found here).

[25]:

df_times = pd.read_csv(os.path.join(data_path, "ilec.csv"))

p = iqplot.ecdf(data=df_times, q="17.5", x_axis_label="Phone Repair Times")

bokeh.io.show(p)

We see that the distribution is heavily positively skewed, making this a good dataset from which to draw studentized boostrap replicates. For computing the confidence intervals of the means, we’ll want to convert the data of interest into arrays, and take a look at the sample mean:

[26]:

times = np.array(df_times["17.5"])
print(f"Mean average repair time: {np.mean(times)} hours")

Mean average repair time: 8.406145520144317 hours

Now, we’ll employ the Studentized bootstrap to compute the confidence interval of the average landline repair time.

[27]:

def studentized_bootstrap_ci(
    data,
    statfunction=np.mean,
    n_bs_samples=1000,
    n_inner_samples=100,
    ptiles=(2.5, 97.5),
):

    thetas = np.empty(n_bs_samples)
    ts = np.empty(n_bs_samples)

    # Outer sample (regular bootstrap sample)
    for i in range(n_bs_samples):
        sample = draw_bs_sample(data)
        theta = statfunction(sample)
        thetas[i] = theta

        # Inner sample
        resample_stats = np.empty(n_inner_samples)
        for j in range(n_inner_samples):
            resample = draw_bs_sample(sample)
            resample_stats[j] = statfunction(resample)

        # Compute statistic of interest
        ts[i] = np.sum(resample_stats - theta) / np.std(resample_stats)

    ts = np.sort(ts)
    s = np.std(thetas)

    p1, p2 = np.percentile(s * ts, ptiles)
    ci_high = np.mean(data) - (s * p1)
    ci_low = np.mean(data) - (s * p2)

    # Return the confidence interval
    return (ci_low, ci_high)

To time small processes, I use the timeit module, which has a method called default_timer() that initializes a timer in seconds when called. The time is constantly counting up, and you can store the time at any given instant in a variable.

[28]:

# Get the start time before the computation
start = timeit.default_timer()
studentized_ci = studentized_bootstrap_ci(times)

# Get the end time after
end = timeit.default_timer()

print(f"Average Repair Time 95% C.I.: {studentized_ci}")

Average Repair Time 95% C.I.: (5.7803879780637235, 11.073870482620665)

We check how long the computation took for 1000 outer bootstrap samples and 100 inner samples. We will also compare this to the time required to get 1000 bootstrap replicates using the percentile method. The studentized method is longer both because of the inner samples and the computations themselves.

[29]:

print(f"Studentized confidence interval took {end-start} seconds to run.")

Studentized confidence interval took 2.3298009399999984 seconds to run.

For comparison, we can compute the non-Studentized confidence intervals.

[30]:

@numba.njit
def draw_bs_reps_mean(data, size=1):
    """Draw boostrap replicates of the mean from 1D data set."""
    out = np.empty(size)
    for i in range(size):
        out[i] = np.mean(draw_bs_sample(data))
    return out

# Get the start time before the computation
start = timeit.default_timer()
times_bs = draw_bs_reps_mean(times, size=N)
times_ci = np.percentile(times_bs, [2.5, 97.5])

# Get the end time after
end = timeit.default_timer()

print(f"Average Repair Time 95% C.I.: {times_ci}")

Average Repair Time 95% C.I.: [7.72853668 9.18253743]

[31]:

print(f"Percentile confidence interval took {end-start} seconds to run.")

Percentile confidence interval took 0.14489769899999771 seconds to run.

The non-Studentized confidence interval is smaller, and we can more easily see the difference in the following plot. We also note that the studentized computation takes longer than the percentile bootstrap even with just 1000 bootstrap samples.

[32]:

percentile_dict = {
    "estimate": np.mean(times),
    "conf_int": times_ci,
    "label": "percentile",
}
studentized_dict = {
    "estimate": np.mean(times),
    "conf_int": studentized_ci,
    "label": "studentized",
}

kwargs = {
    "title": "95% Confidence Intervals",
    "x_axis_label": "Average Phone Repair Time (hours)",
}

p = bebi103.viz.confints(
    [percentile_dict, studentized_dict],
    palette=colorcet.b_glasbey_category10[0],
    **kwargs
)
bokeh.io.show(p)

Studentized confidence intervals are most useful for very small datasets (so less likely to be representative of the population) or datasets that are highly skewed. The extra (inner) sampling attempts to correct for the lack of information with a more rigorous treatment.

The main drawback is that it takes much longer than the percentile bootstrapping method. For datasets that are not skewed or very small, the two methods should give approximately equal confidence intervals. If you like, you can repeat this analysis on the bees and pesticides data in lesson 16. This sample distribution is closer to the population distribution, so the two methods give similar confidence intervals.

Computing environment

[33]:

%load_ext watermark
%watermark -v -p numpy,scipy,numba,bokeh,iqplot,bebi103,colorcet,jupyterlab

Python implementation: CPython
Python version       : 3.9.13
IPython version      : 7.31.1

numpy     : 1.21.5
scipy     : 1.7.3
numba     : 0.55.1
bokeh     : 2.4.3
iqplot    : 0.3.2
bebi103   : 0.1.13
colorcet  : 3.0.0
jupyterlab: 3.4.4