22. Implementation of MLE for variate-covariate models

Dataset download

[2]:

import warnings

import tqdm

import numpy as np
import polars as pl
import scipy.optimize
import scipy.stats as st

import iqplot
import bebi103

import bokeh.io
bokeh.io.output_notebook()

Loading BokehJS ...

We now proceed to do maximum likelihood estimation for the parameters of the second model for spindle length.

\begin{align} &\mu_i = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}, \\[1em] &l_i \sim \text{Norm}(\mu_i, \sigma) \;\forall i, \end{align}

The parameters for this model are \(\gamma\), \(\phi\), and \(\sigma\).

Finally, before we being doing MLE, as a reminder, we will load in and plot the data.

[3]:

df = pl.read_csv(os.path.join(data_path, 'good_invitro_droplet_data.csv'), comment_prefix='#')

p = bokeh.plotting.figure(
    frame_height=300,
    frame_width=400,
    x_axis_label="droplet diameter (µm)",
    y_axis_label="spindle length (µm)",
    x_range=[0, 250],
    y_range=[0, 50],
)

p.scatter(
    source=df.to_dict(),
    x="Droplet Diameter (um)",
    y="Spindle Length (um)",
    alpha=0.3,
)

bokeh.io.show(p)

It is also useful to extract the data sets as Numpy arrays for speed in the MLE and bootstrap calculations.

[4]:

d = df['Droplet Diameter (um)'].to_numpy()
ell = df['Spindle Length (um)'].to_numpy()

Performing MLE by direct numerical optimization

We can take the same approach as we did with the mRNA counts from the smFISH experiments. We write a function for the log likelihood and then use numerical optimization to find the parameters.

The joint probability density function for the data set is the product of the probability densities of Normals. Therefore, the log likelihood is the sum of log probability densities of Normals. We can code that up using the convenient functions in the scipy.stats module.

[5]:

def theor_spindle_length(gamma, phi, d):
    """Compute spindle length using mathematical model"""
    return gamma * d / np.cbrt(1 + (gamma * d / phi)**3)


def log_likelihood(params, d, ell):
    """Log likelihood of spindle length model."""
    gamma, phi, sigma = params

    if gamma <= 0 or gamma > 1 or phi <= 0:
        return -np.inf

    mu = theor_spindle_length(gamma, phi, d)

    return np.sum(st.norm.logpdf(ell, mu, sigma))

In looking at the code for the log likelihood, the convenience of the scipy.stats module is clear. We more or less code up the model as we would say it!

We can now code up the MLE calculation. For our initial guesses, we can look at the data an make estimates. The initial slope of the \(l\) vs. \(d\) curve is about one-half, so we will guess an initial value of \(\gamma\) of 0.5. The plateau seems to be around 40 µm, so we will guess \(\phi = 40\) µm. Eyeballing the standard deviation in the data points, I would guess it is about 5 µm.

[6]:

def spindle_mle(d, ell):
    """Compute MLE for parameters in spindle length model."""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        res = scipy.optimize.minimize(
            fun=lambda params, d, ell: -log_likelihood(params, d, ell),
            x0=np.array([0.5, 35, 5]),
            args=(d, ell),
            method='Powell'
        )

    if res.success:
        return res.x
    else:
        raise RuntimeError('Convergence failed with message', res.message)

Let’s run the calculation to get MLEs for the parameters.

[7]:

mle_params = spindle_mle(d, ell)

mle_params

[7]:

array([ 0.86047455, 38.23124995,  3.75342168])

We can take a quick look as to how the theoretical curve might look with respect to the data.

[8]:

# Compute theoretical curve using MLE parameters
d_theor = np.linspace(0, 250, 200)
ell_theor = theor_spindle_length(*mle_params[:-1], d_theor)

# Add theoretical curve to the plot
p.line(d_theor, ell_theor, line_width=2, line_color="orange")

bokeh.io.show(p)

Of course, we are not using the full generative model here, since we are not using the parameter \(\sigma\). We will discuss in much more detail in the lesson on model assessment how to visualize the entire generative model as it relates to the measured data.

Least squares

We worked out in our lesson on MLE of variate-covariate models that when we have a homoscedastic Normal model for residuals, we may compute the MLE by first minimizing the residual sum of squares and then computing the MLE for the variance of the Normally distributed residuals as \(\sigma^* = \sqrt{\text{RSS}^* / n}\), where \(n\) is the total number of data points.

Conveniently, this is the form of our generative model, so we can split the optimization problem in two parts. First, find the values of parameters \(\gamma\) and \(\phi\) that minimize the residual sum of squares. Then, using these values, compute the MLE for \(\sigma\) using the analytical result we have just derived.

The Levenberg-Marquardt algorithm is a fast, robust algorithm to perform minimization of the RSS. A variant of this is implemented in scipy.optimization.least_squares() (docs). To use that algorithm you need to define a function for the residual; that is a function that returns the quantity \(l_i - \mu_i\). The function has call signature f(params, *args), where params is a Numpy array containing the parameters you will use in the optimization (in our case, \(\gamma\) and \(\phi\)), and args is a tuple of any other arguments passed to the function, which almost always include the data (in our case (d, ell)). We can code up that function.

[9]:

def resid(params, d, ell):
    """Residual for spindle length model."""
    return ell - theor_spindle_length(*params, d)

We can call scipy.optimize.least_squares() to perform the MLE. We can also specify bounds as a tuple containing a list of the lower bounds of the respective parameters and a list of the upper bounds of the respective parameters. We bound \(\gamma\) between zero and one, and \(\phi\) just has to be positive.

[10]:

res = scipy.optimize.least_squares(
    resid, np.array([0.5, 35]), args=(d, ell), bounds=([0, 0], [1, np.inf])
)

# Compute residual sum of squares from optimal params
rss_mle = np.sum(resid(res.x, d, ell)**2)

# Compute MLE for sigma
sigma_mle = np.sqrt(rss_mle / len(d))

# Take a look
res.x, sigma_mle

[10]:

(array([ 0.8601068, 38.2467691]), 3.754999871543999)

Indeed, we got the same results as before. Let’s code this procedure up into a function. Looking ahead to using bebi103.bootstrap.draw_bs_reps_mle(), we need the function to take the data set as a single argument.

[11]:

def spindle_mle_lstq(data):
    """Compute MLE for parameters in spindle length model."""
    # Unpack data
    d = data[:, 0]
    ell = data[:, 1]

    res = scipy.optimize.least_squares(
        resid, np.array([0.5, 35]), args=(d, ell), bounds=([0, 0], [1, np.inf])
    )

    # Compute residual sum of squares from optimal params
    rss_mle = np.sum(resid(res.x, d, ell)**2)

    # Compute MLE for sigma
    sigma_mle = np.sqrt(rss_mle / len(d))

    return tuple([x for x in res.x] + [sigma_mle])

The big advantages of doing this are the robustness and speed of the Levenberg-Marquardt algorithm. Let’s do a speed test.

[12]:

print("Timing for MLE by Powell's method:")
%timeit spindle_mle(d, ell)

print("\nTiming for MLE by least_squares:")
data = df[['Droplet Diameter (um)', 'Spindle Length (um)']].to_numpy()
%timeit spindle_mle_lstq(data)

Timing for MLE by Powell's method:
6.39 ms ± 11.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Timing for MLE by least_squares:
1.37 ms ± 16.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

That’s a substantial speed boost of nearly 4×.

Confidence intervals

We can now proceed to compute confidence intervals for our parameters. We will do a parametric bootstrap here. We use the generative model parametrized by the MLE to regenerate data sets. For each value of \(d\) (which are fixed), we draw values of \(l\) using the generative distribution). Then we perform MLE of those to get confidence intervals. To use bebi103.draw_bs_reps_mle(), then, we just are left to supply the bootstrap sample generating function.

[13]:

def gen_spindle_data(params, d, size, rng):
    """Generate a new spindle data set."""
    mu = theor_spindle_length(*params[:-1], d)
    sigma = params[-1]

    return np.vstack((d, rng.normal(mu, sigma))).transpose()

Note that we were careful to make sure the data set has the appropriate dimensions; the row indexes the trial. We can now compute our replicates and confidence intervals.

[14]:

bs_reps = bebi103.bootstrap.draw_bs_reps_mle(
    spindle_mle_lstq,
    gen_spindle_data,
    data=df[["Droplet Diameter (um)", "Spindle Length (um)"]].to_numpy(),
    mle_args=(),
    gen_args=(d,),
    size=10_000,
    n_jobs=9,
    progress_bar=False,
)

# Compute confidence intervals
conf_ints = np.percentile(bs_reps, [2.5, 97.5], axis=0)

# Print the results
print(
    """95% Confidence intervals
γ: [ {0:.2f},  {1:.2f}]
φ: [{2:.2f}, {3:.2f}]
σ: [ {4:.2f},  {5:.2f}]
""".format(
        *conf_ints.transpose().ravel()
    )
)

95% Confidence intervals
γ: [ 0.83,  0.89]
φ: [37.53, 39.01]
σ: [ 3.55,  3.95]

The confidence intervals are given in each column for \(\gamma\), \(\phi\), and \(\sigma\), respectively, and are quite tight. This is because we have lots of data points. This does not mean that the data are tightly distributed about the theoretical curve, only that the MLE estimates for the parameters will not vary much if we repeated the experiment and analysis.

Other confidence intervals

We have computed parametric confidence intervals by regenerating spindle lengths for each droplet diameter we measured using the generative model parametrized by the MLE. There are many ways to compute confidence intervals for variate-covariate models, including pairs, residual, and wild bootstrap. Briefly, here are the respective techniques.

Pairs bootstrap: As we saw when doing nonparametric bootstrap confidence intervals for correlation coefficients, we can apply a similar technique here. We resample the data set as, in this case, pairs of droplet diameters and spindle lengths \((d_i, \ell_i)\), and compute MLE estimates for the parameters for each bootstrap sample.

Residual bootstrap: In residual bootstrap, the residuals are resampled. The procedure goes as follows.

Compute the MLE for the parameters of the model, \(\theta^*\). In the spindle length example, \(\theta^* = (\gamma^*, \phi^*)\).
Define \(\hat{y}_i = y(x_i; \theta^*)\) to be the value of \(y\) computed as point \(x_i\) using the theoretical curve parametrized by the MLE estimate for \(\theta\).
Define residual \(i\) as \(\epsilon_i = y_i - \hat{y}_i\).
Draw a bootstrap sample of the residuals, randomly assigning a new residual to each datum. Let \(\epsilon_j\) be the \(j\)th resampled residual (that is the \(j\)th residual in the bootstrap sample of residuals).
Generate a bootstrap sample of your data set where the \(j\)th data point \((x_j, \hat{y}_j + \epsilon_j)\).
Recompute the MLE for \(\theta\) from this bootstrap sample.

Wild bootstrap: In wild bootstrap, the residuals are resampled parametrically. If residual \(i\) is \(\epsilon_i = y_i - \hat{y}_i\), then a new residual applied to data point \(i\) is \(\epsilon_i\,v_i\), where \(v_i\) is drawn from a Normal distribution with mean zero and variance one. Thus, data point \(i\) in the bootstrap sample is \((x_i, \hat{y}_i + \epsilon_i\,v_i)\). The MLE for \(\theta\) is then computed from this bootstrap sample.

Why I do not advocate for using these other methods

Pairs bootstrap is a nonparametric way of getting bootstrap samples, with the bootstrap replicate computed using a parametric inference technique. This can work well if we have a large number of data points and no x-values are severely undersampled in the experiment. This is not always the case, however, though it is in the case of the spindle length data.

Residual bootstrap implicitly models the residuals of homoscedastic, since any given residual may be applied to any given data point. Again, this is often not the case.

Wild bootstrap is a twist on residual bootstrap that attempts to model heteroscedasticity. In using wild bootstrap, we have implicitly baked into model that the residual for each data point is normally distributed about the empirical residual with a scale parameter that scales with the size of the empirical residual. This is a way of dealing with heteroscedasticity without seriously considering the variation in measurement in a generative way. But, we really should think generatively. If we are going to model, we should model everything. A full generative model, including how we might expect the variability to change as a function of the measured and/or explanatory variables, is preferable.

For variate-covariate models, I usually prefer to do the parametric bootstrap method we have covered in this lesson. If I have enough well-sampled data points, then pairs bootstrap is a reasonable way to nonparametrically get bootstrap samples to compute a confidence interval. If I am sure of homoscedasticity, residual bootstrap is a good method. I generally shy away from wild bootstrap completely, since I prefer to explicitly model heteroscedasticity.

Computing environment

[15]:

%load_ext watermark
%watermark -v -p numpy,scipy,polars,bokeh,bebi103,jupyterlab

Python implementation: CPython
Python version       : 3.12.3
IPython version      : 8.25.0

numpy     : 1.26.4
scipy     : 1.13.1
polars    : 1.1.0
bokeh     : 3.4.1
bebi103   : 0.1.21
jupyterlab: 4.0.13