17. Hacker’s approach to NHST

Dataset download

[2]:

import numpy as np
import polars as pl

import numba

import iqplot

import bokeh.io
bokeh.io.output_notebook()

Loading BokehJS ...

In this lesson, we will perform null hypothesis significance tests (NHST) using the drone sperm quality data set. Of course, we need to load in the data set first.

[3]:

df = pl.read_csv(
    os.path.join(data_path, "bee_sperm.csv"),
    null_values=["NO SPERM", "No Sperm"],
    comment_prefix="#",
)
df.head()

[3]:

shape: (5, 18)

Specimen	Treatment	Environment	TreatmentNCSS	Sample ID	Colony	Cage	Sample	Sperm Volume per 500 ul	Quantity	ViabilityRaw (%)	Quality	Age (d)	Infertil	AliveSperm	Quantity Millions	Alive Sperm Millions	Dead Sperm Millions
i64	str	str	i64	str	i64	i64	i64	i64	i64	f64	f64	i64	i64	i64	f64	f64	f64
227	"Control"	"Cage"	1	"C2-1-1"	2	1	1	2150000	2150000	96.726381	96.726381	14	0	2079617	2.15	2.079617	0.070383
228	"Control"	"Cage"	1	"C2-1-2"	2	1	2	2287500	2287500	96.349808	96.349808	14	0	2204001	2.2875	2.204001	0.083499
229	"Control"	"Cage"	1	"C2-1-3"	2	1	3	87500	87500	98.75	98.75	14	0	86406	0.0875	0.086406	0.001094
230	"Control"	"Cage"	1	"C2-1-4"	2	1	4	1875000	1875000	93.287421	93.287421	14	0	1749139	1.875	1.749139	0.125861
231	"Control"	"Cage"	1	"C2-1-5"	2	1	5	1587500	1587500	97.792506	97.792506	14	0	1552456	1.5875	1.552456	0.035044

Because we will be using Numba’d functions to rapidly do our sampling, we will go ahead and slice out the subsets of the data frame we will use for inference as Numpy arrays. We will use the specimens that have nonzero sperm count of alive and dead sperm.

[4]:

# Extract NumPy arrays (only use values greater than zero for logs)
inds = (pl.col("Alive Sperm Millions") > 0) & (pl.col("Dead Sperm Millions") > 0)
dfs = df.filter(inds).partition_by('Treatment', as_dict=True)

alive_ctrl = dfs[('Control',)].get_column("Alive Sperm Millions").to_numpy()
alive_pest = dfs[('Pesticide',)].get_column("Alive Sperm Millions").to_numpy()
dead_ctrl = dfs[('Control',)].get_column("Dead Sperm Millions").to_numpy()
dead_pest = dfs[('Pesticide',)].get_column("Dead Sperm Millions").to_numpy()

Finally, so we have them available, we will define some of the utility functions we defined earlier when working with hacker stats.

[5]:

@numba.njit
def draw_bs_sample(data):
    """Draw a bootstrap sample from a 1D data set."""
    return np.random.choice(data, size=len(data))


@numba.njit
def bivariate_r(x, y):
    """
    Compute plug-in estimate for the bivariate correlation coefficient.
    """
    return (
        np.sum((x - np.mean(x)) * (y - np.mean(y)))
        / np.std(x)
        / np.std(y)
        / np.sqrt(len(x))
        / np.sqrt(len(y))
    )

Permutation hypothesis tests

Recall the steps for performing a NHST.

Clearly state the null hypothesis.
Define a test statistic, a scalar value that you can compute from data. Compute it directly from your measured data.
Simulate data acquisition for the scenario where the null hypothesis is true. Do this many times, computing and storing the value of the test statistic each time.
The fraction of simulations for which the test statistic is at least as extreme as the test statistic computed from the measured data is called the p-value, which is what you report.

For one special type of hypothesis, there is a very straight-forward way of simulating it. Here is our hypothesis: the control and pesticide-treated samples have exactly the same distribution. To simulate this, we take the following steps for two data sets, a control with \(n\) measurements and a test the other with \(m\).

Concatenate the two data sets into one.
Randomly scramble the order of the combined data set.
Designate the first \(n\) entries in this scrambled array to be “control” and the remaining to be “test.”

This simulation is exact; it is as if the label of the data set has no meaning; hence the distributions of the two data sets are entirely equal. We test such a null hypothesis with a permutation test. A permutation sample is akin to a bootstrap sample; it is a new pair of data sets generated after scrambling the concatenated data set. A permutation replicate is a value of the test statistic computed from a permutation sample. For this test, we will use the difference of means as the test statistic.

Let’s code this up. We can write a function to draw a permutation sample, and then a function to draw permutation replicates for an arbitrary user-supplied summary statistic.

[6]:

@numba.njit
def draw_perm_sample(x, y):
    """Generate a permutation sample."""
    concat_data = np.concatenate((x, y))
    np.random.shuffle(concat_data)

    return concat_data[:len(x)], concat_data[len(x):]


def draw_perm_reps(x, y, stat_fun, size=1):
    """Generate array of permuation replicates."""
    return np.array([stat_fun(*draw_perm_sample(x, y)) for _ in range(size)])

Since we will look at the difference of means specifically, we will also code up a custom function to draw replicates of the difference of means.

[7]:

@numba.njit(parallel=True)
def draw_perm_reps_diff_mean(x, y, size=1):
    """Generate array of permuation replicates."""
    out = np.empty(size)
    for i in numba.prange(size):
        x_perm, y_perm = draw_perm_sample(x, y)
        out[i] = np.mean(x_perm) - np.mean(y_perm)

    return out

To perform the hypothesis test, then, with the difference of means as our test statistic, we have only to draw many replicates and then tally up how many of them are more extreme than the observed difference of means.

[8]:

# Compute test statistic for original data set
diff_mean = np.mean(alive_ctrl) - np.mean(alive_pest)

# Draw replicates
perm_reps = draw_perm_reps_diff_mean(alive_ctrl, alive_pest, size=10_000)

# Compute p-value
p_val = np.sum(perm_reps >= diff_mean) / len(perm_reps)

print('p-value =', p_val)

p-value = 0.0001

Whoa! Wait a minute. The p-value is zero? This just means that in all of the 10,000 replicates we took, not one had a test statistic as extreme as was observed. We cannot resolve p-values much less than 0.001 with 10,000 permutation replicates. Let’s try taking ten million replicates as see how we do. This will take a few seconds to run (which again highlights the utility of using Numpy arrays with Numba).

[9]:

# Compute test statistic for original data set
diff_mean = np.mean(alive_ctrl) - np.mean(alive_pest)

# Draw replicates
perm_reps = draw_perm_reps_diff_mean(alive_ctrl, alive_pest, size=10_000_000)

# Compute p-value
p_val = np.sum(perm_reps >= diff_mean) / len(perm_reps)

print('p-value =', p_val)

p-value = 3.39e-05

So, our p-value is quite small, less than \(10^{-4}\). This means that the probability of getting a difference of means as extreme as was observed under the null hypothesis that the control and test samples were drawn from identical distribution is quite small.

Permutation hypothesis test for correlation

As we continue to explore the data set, we might want to consider a hypothesis: dead and alive sperm counts are uncorrelated. It is possible that under this hypothesis, we could see correlation at the same level observed in the real data. We can test this hypothesis by permutation. In this case, we scramble the labels “dead” and “alive” and see what correlation we get. We will do 10 million permutation replicates. This will take a while to compute.

[10]:

# Compute bivariate r for actual data
r_ctrl = bivariate_r(np.log(alive_ctrl), np.log(dead_ctrl))
r_pest = bivariate_r(np.log(alive_pest), np.log(dead_pest))

# Get permutation replicates
perm_reps_ctrl = draw_perm_reps(
    np.log(alive_ctrl), np.log(dead_ctrl), bivariate_r, size=10_000_000
)
perm_reps_pest = draw_perm_reps(
    np.log(alive_pest), np.log(dead_pest), bivariate_r, size=10_000_000
)

# Compute p-value
p_ctrl = np.sum(np.abs(perm_reps_ctrl) > np.abs(r_ctrl)) / len(perm_reps_ctrl)
p_pest = np.sum(np.abs(perm_reps_pest) > np.abs(r_pest)) / len(perm_reps_pest)

print(
    """
p-value control:   {0:.3e}
p-value pesticide: {1:.3e}
""".format(
        p_ctrl, p_pest
    )
)


p-value control:   0.000e+00
p-value pesticide: 1.200e-06

Note that here we used the absolute value of the permutation replicates. We did this because we are interested to see if we get correlation at all, including anticorrelation. This is another essential piece of a hypothesis test: you need to define what it means to be more extreme.

The p-value for both is tiny; not one in 10 million permutations for the control sample gave a correlation as high as was observed, and only 10 or so of the pesticide treated group. So, we would report something like \(p < 10^{-6}\) for control and \(p < 10^{-5}\) for pesticide.

Bootstrap hypothesis tests

Permutation tests are great: they exactly simulate the null hypothesis that the two samples come from the same generative distribution. But they are limited to this specific null hypothesis. What if we had a different hypothesis, say only that the means of the two distributions we are comparing are equal, but other properties of the generative distributions need not be identical? In this case, we cannot use the permutations to simulate the null hypothesis.

Instead, we can simulate the null hypothesis by shifting the means of the control and test distributions so that they are equal. We then take a bootstrap sample out of each of the shifted data sets. We compute our test statistic from these two bootstrap samples to get a bootstrap replicate. Then, the number of bootstrap replicates that are at least as extreme as the test statistic computed from the original data is used to compute the p-value. Let’s see this in action. First, we’ll see how to shift the data sets and what the resulting ECDFs look like.

[11]:

# Shift data sets
total_mean = np.mean(np.concatenate((alive_ctrl, alive_pest)))
alive_ctrl_shift = alive_ctrl - np.mean(alive_ctrl) + total_mean
alive_pest_shift = alive_pest - np.mean(alive_pest) + total_mean

# Plot the ECDFs
df_shift = pl.DataFrame(
    data={
        "treatment": ["control"] * len(alive_ctrl_shift)
        + ["pesticide"] * len(alive_pest_shift),
        "alive sperm (millions)": np.concatenate((alive_ctrl_shift, alive_pest_shift)),
    }
)
p = iqplot.ecdf(df_shift, q="alive sperm (millions)", cats="treatment")

bokeh.io.show(p)

The distributions now have the same mean, but nothing else about them has changed. They still have the same shape as before. Now, let’s draw bootstrap samples out of these shifted distributions and see how they compare to the original ECDFs.

[12]:

p = iqplot.ecdf(
    df,
    q="Alive Sperm Millions",
    cats="Treatment",
    x_axis_label="alive sperm (millions)",
)

for _ in range(100):
    ctrl_rep = draw_bs_sample(alive_ctrl_shift)
    pest_rep = draw_bs_sample(alive_pest_shift)
    df_rep = pl.DataFrame(
        data={
            "Treatment": ["Control"] * len(ctrl_rep) + ["Pesticide"] * len(pest_rep),
            "Alive Sperm Millions": np.concatenate((ctrl_rep, pest_rep)),
        }
    )

    p = iqplot.ecdf(
        df_rep,
        q="Alive Sperm Millions",
        cats="Treatment",
        p=p,
        line_kwargs=dict(alpha=0.02),
        show_legend=False,
    )

bokeh.io.show(p)

The blue-orange haze in between the ECDFs of the original data are the bootstrap samples under the null hypothesis. Only on rare occurrence does that haze reach the original ECDFs, so we might suspect that the observed data might not be all that probable under the null hypothesis. Let’s perform the hypothesis test.

[13]:

@numba.njit(parallel=True)
def draw_bs_reps_diff_mean(x, y, size=1):
    """
    Generate bootstrap replicates with difference of means
    as the test statistic.
    """
    out = np.empty(size)
    for i in numba.prange(size):
        out[i] = np.mean(draw_bs_sample(x)) - np.mean(draw_bs_sample(y))

    return out


# Generate samples (10 million again)
bs_reps = draw_bs_reps_diff_mean(alive_ctrl_shift, alive_pest_shift, size=10_000_000)

# Compute p-value
p_val = np.sum(bs_reps >= diff_mean) / len(bs_reps)

print("p-value =", p_val)

p-value = 7.9e-06

This p-value is of similar magnitude as what we got from the permutation test.

Simulating a hypothesis: challenging and rewarding

As you can see from the above examples with permutation and bootstrap hypothesis tests, the part of the procedure that requires the most thought and care is simulating the null hypothesis. For a test where we are comparing means, we needed to shift the distributions so that all other aspects of the distribution remain the same, and then resample from the shifted distribution.

The strength of doing these kinds of hypothesis tests is that they are entirely nonparametric. There is no need to specify a probability distribution for the null hypothesis; you can just make statements about how the two distributions you are comparing are related.

Computing environment

[14]:

%load_ext watermark
%watermark -v -p numpy,polars,numba,bokeh,iqplot,jupyterlab

Python implementation: CPython
Python version       : 3.12.4
IPython version      : 8.25.0

numpy     : 1.26.4
polars    : 1.1.0
numba     : 0.59.1
bokeh     : 3.4.1
iqplot    : 0.3.7
jupyterlab: 4.0.13