28 Pairs bootstrap and correlation

Code

# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade polars bebi103 watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import numpy as np
import polars as pl

import numba

import bebi103

import bokeh
bokeh.io.output_notebook()

Loading BokehJS ...

We continue our analysis of the drone sperm quality data set. Let’s load it and remind ourselves of the content.

df = pl.read_csv(
    os.path.join(data_path, "bee_sperm.csv"),
    null_values=["NO SPERM", "No Sperm"],
    comment_prefix="#",
)
df.head()

shape: (5, 18)

Specimen	Treatment	Environment	TreatmentNCSS	Sample ID	Colony	Cage	Sample	Sperm Volume per 500 ul	Quantity	ViabilityRaw (%)	Quality	Age (d)	Infertil	AliveSperm	Quantity Millions	Alive Sperm Millions	Dead Sperm Millions
i64	str	str	i64	str	i64	i64	i64	i64	i64	f64	f64	i64	i64	i64	f64	f64	f64
227	"Control"	"Cage"	1	"C2-1-1"	2	1	1	2150000	2150000	96.726381	96.726381	14	0	2079617	2.15	2.079617	0.070383
228	"Control"	"Cage"	1	"C2-1-2"	2	1	2	2287500	2287500	96.349808	96.349808	14	0	2204001	2.2875	2.204001	0.083499
229	"Control"	"Cage"	1	"C2-1-3"	2	1	3	87500	87500	98.75	98.75	14	0	86406	0.0875	0.086406	0.001094
230	"Control"	"Cage"	1	"C2-1-4"	2	1	4	1875000	1875000	93.287421	93.287421	14	0	1749139	1.875	1.749139	0.125861
231	"Control"	"Cage"	1	"C2-1-5"	2	1	5	1587500	1587500	97.792506	97.792506	14	0	1552456	1.5875	1.552456	0.035044

28.1 Correlation

We might wish to investigate how two measured quantities are correlated. For example, if the number of dead sperm and the number of alive sperm are closely correlated, this would mean that a given drone produces some quantity of sperm and some fraction tend to be dead. Let’s take a look at this.

# Set up plot on log scale
p = bokeh.plotting.figure(
    frame_height=300,
    frame_width=300,
    x_axis_label="alive sperm (millions)",
    y_axis_label="dead sperm (millions)",
    x_axis_type="log",
    y_axis_type="log",
)

# Only use values greater than zero for log scale
inds = (pl.col("Alive Sperm Millions") > 0) & (pl.col("Dead Sperm Millions") > 0)

# Populate glyphs
for color, ((treatment,), sub_df) in zip(
    bokeh.palettes.Category10_3, df.filter(inds).group_by("Treatment")
):
    p.scatter(
        source=sub_df.to_dict(),
        x="Alive Sperm Millions",
        y="Dead Sperm Millions",
        color=color,
        legend_label=treatment,
    )

p.legend.location = "bottom_right"

bokeh.io.show(p)

There seems to be some correlation (on a log scale), but it is difficult to tell. We can compute the correlation with the bivariate correlation coefficient, also known as the Pearson correlation. It is the plug-in estimate of the correlation between variables (in this case alive and dead sperm). The correlation is the covariance divided by the geometric mean of the individual variances

The bivariate correlation coefficient is implemented with np.corrcoef(), but we will code our own and JIT it for speed.

@numba.njit
def bivariate_r(x, y):
    """
    Compute plug-in estimate for the bivariate correlation coefficient.
    """
    return (
        np.sum((x - np.mean(x)) * (y - np.mean(y)))
        / np.std(x)
        / np.std(y)
        / np.sqrt(len(x))
        / np.sqrt(len(y))
    )

We can use it to compute the bivariate correlation coefficient for the logarithm of alive and dead sperm.

bivariate_r(
    np.log(df.filter(inds).get_column("Alive Sperm Millions").to_numpy()),
    np.log(df.filter(inds).get_column("Dead Sperm Millions").to_numpy()),
)

0.5219944217488051

28.2 Pairs bootstrap confidence intervals

How can we get a confidence interval on a correlation coefficient? We can again apply the bootstrap, but this time, the replicate is a pair of data, in this case a dead sperm count/alive sperm count pair. The process of drawing pairs of data points from an experiment and then computing bootstrap replicates from them is called pairs bootstrap. Let’s code it up for this example with the bivariate correlation.

Our strategy in coding up the pairs bootstrap is to draw bootstrap samples of the indices of measurement and use those indices to select the pairs.

@numba.njit
def draw_bs_sample(data):
    """Draw a bootstrap sample from a 1D data set."""
    return np.random.choice(data, size=len(data))


@numba.njit
def draw_bs_pairs(x, y):
    """Draw a pairs bootstrap sample."""
    inds = np.arange(len(x))
    bs_inds = draw_bs_sample(inds)
    
    return x[bs_inds], y[bs_inds]

With our pairs sampling function in place, we can write a function to compute replicates.

@numba.njit(parallel=True)
def draw_bs_pairs_reps_bivariate(x, y, size=1):
    """
    Draw bootstrap pairs replicates.
    """
    out = np.empty(size)

    for i in numba.prange(size):
        out[i] = bivariate_r(*draw_bs_pairs(x, y))

    return out

Finally, we can put it all together to compute confidence intervals on the correlation. To start, we extract all of the relevant measurements as Numpy arrays to allow for faster resampling (and that’s what our Numba’d functions require).

# Extract NumPy arrays (only use values greater than zero for logs)
inds = (pl.col("Alive Sperm Millions") > 0) & (pl.col("Dead Sperm Millions") > 0)
dfs = df.filter(inds).partition_by('Treatment', as_dict=True)

alive_ctrl = dfs[('Control',)].get_column("Alive Sperm Millions").to_numpy()
alive_pest = dfs[('Pesticide',)].get_column("Alive Sperm Millions").to_numpy()
dead_ctrl = dfs[('Control',)].get_column("Dead Sperm Millions").to_numpy()
dead_pest = dfs[('Pesticide',)].get_column("Dead Sperm Millions").to_numpy()

Now we can compute the bootstrap replicates using our draw_bs_pairs_reps_bivariate() function.

# Get reps
bs_reps_ctrl = draw_bs_pairs_reps_bivariate(
    np.log(alive_ctrl), np.log(dead_ctrl), size=10_000
)

bs_reps_pest = draw_bs_pairs_reps_bivariate(
    np.log(alive_pest), np.log(dead_pest), size=10_000
)

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

And from the replicates, we can compute and print the 95% confidence interval.

# Get the confidence intervals
conf_int_ctrl = np.percentile(bs_reps_ctrl, [2.5, 97.5])
conf_int_pest = np.percentile(bs_reps_pest, [2.5, 97.5])

# Plot confidence intervals
summaries = [
    dict(
        label="control",
        estimate=bivariate_r(np.log(alive_ctrl), np.log(dead_ctrl)),
        conf_int=conf_int_ctrl,
    ),
    dict(
        label="treated",
        estimate=bivariate_r(np.log(alive_pest), np.log(dead_pest)),
        conf_int=conf_int_pest,
    ),
]

bokeh.io.show(
    bebi103.viz.confints(summaries, x_axis_label="bivariate correlation of logs")
)

We see a clear correlation in both samples, with a wide, but positive, confidence interval. Note that we did this analysis on a log scale, since the data span several orders of magnitude.

28.3 Computing environment

%load_ext watermark
%watermark -v -p numpy,polars,numba,bokeh,bebi103,jupyterlab

Python implementation: CPython
Python version       : 3.13.5
IPython version      : 9.4.0

numpy     : 2.2.6
polars    : 1.31.0
numba     : 0.61.2
bokeh     : 3.7.3
bebi103   : 0.1.28
jupyterlab: 4.4.5