import numpy as npimport polars as plimport numbaimport bebi103import bokehbokeh.io.output_notebook()
Loading BokehJS ...
We continue our analysis of the drone sperm quality data set. Let’s load it and remind ourselves of the content.
df = pl.read_csv( os.path.join(data_path, "bee_sperm.csv"), null_values=["NO SPERM", "No Sperm"], comment_prefix="#",)df.head()
shape: (5, 18)
Specimen
Treatment
Environment
TreatmentNCSS
Sample ID
Colony
Cage
Sample
Sperm Volume per 500 ul
Quantity
ViabilityRaw (%)
Quality
Age (d)
Infertil
AliveSperm
Quantity Millions
Alive Sperm Millions
Dead Sperm Millions
i64
str
str
i64
str
i64
i64
i64
i64
i64
f64
f64
i64
i64
i64
f64
f64
f64
227
"Control"
"Cage"
1
"C2-1-1"
2
1
1
2150000
2150000
96.726381
96.726381
14
0
2079617
2.15
2.079617
0.070383
228
"Control"
"Cage"
1
"C2-1-2"
2
1
2
2287500
2287500
96.349808
96.349808
14
0
2204001
2.2875
2.204001
0.083499
229
"Control"
"Cage"
1
"C2-1-3"
2
1
3
87500
87500
98.75
98.75
14
0
86406
0.0875
0.086406
0.001094
230
"Control"
"Cage"
1
"C2-1-4"
2
1
4
1875000
1875000
93.287421
93.287421
14
0
1749139
1.875
1.749139
0.125861
231
"Control"
"Cage"
1
"C2-1-5"
2
1
5
1587500
1587500
97.792506
97.792506
14
0
1552456
1.5875
1.552456
0.035044
28.1 Correlation
We might wish to investigate how two measured quantities are correlated. For example, if the number of dead sperm and the number of alive sperm are closely correlated, this would mean that a given drone produces some quantity of sperm and some fraction tend to be dead. Let’s take a look at this.
# Set up plot on log scalep = bokeh.plotting.figure( frame_height=300, frame_width=300, x_axis_label="alive sperm (millions)", y_axis_label="dead sperm (millions)", x_axis_type="log", y_axis_type="log",)# Only use values greater than zero for log scaleinds = (pl.col("Alive Sperm Millions") >0) & (pl.col("Dead Sperm Millions") >0)# Populate glyphsfor color, ((treatment,), sub_df) inzip( bokeh.palettes.Category10_3, df.filter(inds).group_by("Treatment")): p.scatter( source=sub_df.to_dict(), x="Alive Sperm Millions", y="Dead Sperm Millions", color=color, legend_label=treatment, )p.legend.location ="bottom_right"bokeh.io.show(p)
There seems to be some correlation (on a log scale), but it is difficult to tell. We can compute the correlation with the bivariate correlation coefficient, also known as the Pearson correlation. It is the plug-in estimate of the correlation between variables (in this case alive and dead sperm). The correlation is the covariance divided by the geometric mean of the individual variances
The bivariate correlation coefficient is implemented with np.corrcoef(), but we will code our own and JIT it for speed.
How can we get a confidence interval on a correlation coefficient? We can again apply the bootstrap, but this time, the replicate is a pair of data, in this case a dead sperm count/alive sperm count pair. The process of drawing pairs of data points from an experiment and then computing bootstrap replicates from them is called pairs bootstrap. Let’s code it up for this example with the bivariate correlation.
Our strategy in coding up the pairs bootstrap is to draw bootstrap samples of the indices of measurement and use those indices to select the pairs.
@numba.njitdef draw_bs_sample(data):"""Draw a bootstrap sample from a 1D data set."""return np.random.choice(data, size=len(data))@numba.njitdef draw_bs_pairs(x, y):"""Draw a pairs bootstrap sample.""" inds = np.arange(len(x)) bs_inds = draw_bs_sample(inds)return x[bs_inds], y[bs_inds]
With our pairs sampling function in place, we can write a function to compute replicates.
@numba.njit(parallel=True)def draw_bs_pairs_reps_bivariate(x, y, size=1):""" Draw bootstrap pairs replicates. """ out = np.empty(size)for i in numba.prange(size): out[i] = bivariate_r(*draw_bs_pairs(x, y))return out
Finally, we can put it all together to compute confidence intervals on the correlation. To start, we extract all of the relevant measurements as Numpy arrays to allow for faster resampling (and that’s what our Numba’d functions require).
We see a clear correlation in both samples, with a wide, but positive, confidence interval. Note that we did this analysis on a log scale, since the data span several orders of magnitude.