Numerical maximum likelihood estimation

Dataset download

[2]:

import itertools
import warnings

import numpy as np
import polars as pl
import scipy.optimize
import scipy.stats as st

import iqplot

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

Loading BokehJS ...

As discussed in the previous lesson, maximum likelihood estimates for parameters may sometimes be computed analytically, but often cannot. In those cases, we need to resort to numerical methods. In this lesson, we demonstrate some numerical methods to perform maximum likelihood estimates of parameters form a Negative Binomial distribution.

The data set

The data come from the Elowitz lab, published in Singer et al., Dynamic Heterogeneity and DNA Methylation in Embryonic Stem Cells, Molec. Cell, 55, 319-331, 2014, available here.

In this paper, the authors investigated cell populations of embryonic stem cells using RNA single molecule fluorescence in situ hybridization (smFISH), a technique that enables them to count the number of mRNA transcripts in a cell for a given gene. They were able to measure four different genes in the same cells. So, for one experiment, they get the counts of four different genes in a collection of cells.

The authors focused on genes that code for pluripotency-associated regulators to study cell differentiation. Indeed, differing gene expression levels are a hallmark of differentiated cells. The authors do not just look at counts in a given cell at a given time. The temporal nature of gene expression is also important. While the authors do not directly look at temporal data using smFISH (since the technique requires fixing the cells), they did look at time lapse fluorescence movies of other regulators. We will not focus on these experiments here, but will discuss how the distribution of mRNA counts acquired via smFISH can serve to provide some insight about the dynamics of gene expression.

The data set we are analyzing now comes from an experiment where smFISH was performed in 279 cells for the genes rex1, rest, nanog, and prdm14. The data set may be downloaded at https://s3.amazonaws.com/bebi103.caltech.edu/data/singer_transcript_counts.csv.

Exploratory data analysis

We first load in the data set and generate ECDFs for the mRNA counts for each of the four genes.

[3]:

# Load DataFrame
df = pl.read_csv(os.path.join(data_path, 'singer_transcript_counts.csv'), comment_prefix='#')

# Take a look
df.head()

[3]:

shape: (5, 4)

Rex1	Rest	Nanog	Prdm14
i64	i64	i64	i64
11	34	39	0
172	91	33	5
261	70	68	0
178	54	88	1
129	54	41	0

Each of the 279 rows has the mRNA counts for each of the four genes. There may be multiple cell types present, and we do now know how many there are. Our aim here is not to find how many cell types there are, but to demonstrate how MLE works. Nonetheless, we should have some idea of the properties of the data set we are exploring. We can start by plotting ECDFs for each of the four genes. It is useful to have the gene names around for iteration here, and throughout the lesson.

[4]:

genes = ["Nanog", "Prdm14", "Rest", "Rex1"]

plots = [
    iqplot.ecdf(
        data=df.get_column(gene),
        q=gene,
        x_axis_label="mRNA count",
        title=gene,
        frame_height=150,
        frame_width=200,
    )
    for gene in genes
]

bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))

Note the difference in the x-axis scales. Clearly, prdm14 has far fewer mRNA copies than the other genes. The presence of two inflection points in the Rex1 EDCF implies bimodality, leading us to suspect that there may be two cell types, or at least more than one cell type. We can plot all pairwise combinations of gene counts to explore further.

For visualizing the data, it will be useful to have labels for the cells so we can compare expression levels in all four cells at one. We will label them according to the Rex1 levels.

[5]:

# Add cell label, ranked lowest to highest in Rex1 expression
df = df.with_columns(pl.col('Rex1').rank(method='dense').alias('cell'))

# Add colors for plotting, using quantitative to color conversion in bebi103 package
df = df.with_columns(pl.Series(name='color', values=bebi103.viz.q_to_color(df["cell"], bokeh.palettes.Viridis256)))

We will now look at all of the pair-wise plots of mRNA expression. We color the dots according to cell ID, with cell 1 having the lowest count of Rex1 mRNA and the last cell having the highest.

[6]:

def pairwise_plot(df, gene1, gene2):
    p = bokeh.plotting.figure(
        frame_height=150,
        frame_width=150,
        x_axis_label=gene1,
        y_axis_label=gene2,
    )

    p.scatter(source=df.to_dict(), x=gene1, y=gene2, color="color", size=2)

    return p


plots = [
    pairwise_plot(df, gene1, gene2) for gene1, gene2 in itertools.combinations(genes, 2)
]

bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=3))

It appears as though there is a correlation between Rest, Nanog, and Rex1. They tend to be high or low together. Prdm14, on the other hand, shows less correlation.

Model for mRNA levels

In this part of the lesson, we will model gene expression of each of the four genes separately, though they are connected by which cell is being measured. We will discuss that later. For now, we develop a model for the mRNA counts for a given gene.

If gene expression is a purely Poisson process, we might expect a Poisson distribution. Or, if the copy number is itself somehow tightly regulated, we might expect a Normal distribution.

Study of gene expression dynamics, largely through fluorescence imaging, has lead to a different story. Expression of many important genes can be bursty, which means that the promoter is on for a period of time in which transcripts are made, and then it is off for a while. The “on” periods are called “bursts” and are themselves well-modeled as a Poisson process. That is to say that the amount of time that a promoter is on is Exponentially distributed. Thus, we can think of a burst as a series of Bernoulli trials. A “failure” is production of an mRNA molecule, and a “success” is a switch to an off state. The number of “successes” we get is equal to the number of bursts we get per decay time of the mRNA. We can define the number of bursts before degradation of the mRNA as \(\alpha\). This is the so-called burst frequency. So, we have a series of Bernoulli trials and we wait for \(\alpha\) successes. Then, \(n\), the total number of failures (which is the number of mRNA transcripts), is Negative Binomially distributed, since this matches the Negative Binomial story. Referring to the parametrization used in the distribution explorer,

\begin{align} n \sim \text{NBinom}(\alpha, \beta), \end{align}

where \(\beta\) is related to the probability \(\theta\) of a success of a Bernoulli trial by \(\theta = \beta/(1+\beta)\).

The meaning of the parameter \(\beta\), and the related quantity \(\theta\), can be a little mystical here. We would like to relate it to the typical burst size, i.e., the typical number of transcripts made per burst. The size of a single given burst (that is, the number of transcripts made in a burst) is geometrically distributed (since it matches that story), so

\begin{align} f(n_\mathrm{burst} ; \theta) = (1-\theta)^{n_\mathrm{burst}}\,\theta. \end{align}

The mean number of transcripts \(b\) in a burst is

\begin{align} b \equiv \left\langle n_\mathrm{burst}\right\rangle &= \sum_{n_\mathrm{burst}=0}^\infty n_\mathrm{burst}(1-\theta)^{n_\mathrm{burst}}\theta\\[1em] &= \theta \sum_{n_\mathrm{burst}=0}^\infty n_\mathrm{burst}(1-\theta)^{n_\mathrm{burst}} \\[1em] &= \theta(1-\theta)\, \frac{\mathrm{d}}{\mathrm{d}(1-\theta)}\sum_{n_\mathrm{burst}=0}^\infty(1-\theta)^{n_\mathrm{burst}} \\[1em] &= \theta(1-\theta)\, \frac{\mathrm{d}}{\mathrm{d}(1-\theta)}\,\frac{1}{\theta}\\[1em] &= -\theta(1-\theta)\, \frac{\mathrm{d}}{\mathrm{d}\theta}\,\frac{1}{\theta} \\[1em] &= \frac{1-\theta}{\theta} \\[1em] &= \frac{1}{\beta}. \end{align}

So we now see that \(1/\beta\) is the typical burst size. Using the Negative Binomial property of mRNA copy numbers of bursty gene expression, we can characterize the expression levels of a given cell type by the two parameters of the Negative Binomial, the burst frequency \(\alpha\) and the burst size \(b = 1/\beta\). These are the two parameters we would like to infer from transcript count data. The conclusion of all this is that we have have our likelihood.

\begin{align} &n \sim \text{NBinom}(\alpha, \beta),\\[1em] &b = 1/\beta. \end{align}

Maximum likelihood estimation by numerical optimization

To compute the MLE for the two parameters, the burst frequency \(\alpha\) and burst size \(\beta\), we need to define the likelihood function. We make the assumption that the number of transcripts in each cell is i.i.d., giving a statistical model of

\begin{align} n_i \sim \text{NBinom}(\alpha,\beta)\;\forall i. \end{align}

Referring to the PMF of the Negative Binomial distribution and making the change of variables \(b=1/\beta\), the likelihood function is

\begin{align} L(\alpha, b;\mathbf{n}) = \prod_i\frac{\Gamma(n_i+\alpha)}{\Gamma(\alpha)n!}\left(\frac{1}{1+b}\right)^\alpha\left(\frac{b}{1+b}\right)^{n_i}, \end{align}

and the log-likelihood is

\begin{align} \ell(\alpha, b;\mathbf{n}) = \ln L(\alpha, b;\mathbf{n}) = \sum_i \ln \left(\frac{\Gamma(n_i+\alpha)}{\Gamma(\alpha)n!}\left(\frac{1}{1+b}\right)^\alpha\left(\frac{b}{1+b}\right)^{n_i}\right). \end{align}

To find the MLE, we need to find the values of \(\alpha\) and \(b\) that satisfy

\begin{align} \frac{\partial \ell}{\partial \alpha} = \frac{\partial \ell}{\partial b} = 0. \end{align}

Unfortunately, no closed form solution exists for this. We therefore need to resort to numerical optimization to find the MLE \(\alpha^*\) and \(b^*\).

Numerical optimization

Numerical optimization is typically implemented to find a minimizers of a function rather than maximizers. The function being minimized is called an objective function. This is not a problem for maximum likelihood estimation; we simply define a negative log-likelihood as our objective function.

Sometimes, we have constraints on the allowed values for the parameters. In our case, both \(\alpha\) and \(\beta\) must be non-negative. So, the statement of the optimization problem to find the MLE is

\begin{align} \text{minimize } (-\ell(\alpha, \beta;\mathbf{n})) \text{ s.t. } \alpha, \beta > 0, \end{align}

where “s.t.” is read “subject to.” If we explicitly consider the constraints, we are performing a constrained optimization problem. Constrained optimization is often considerably more challenging than unconstrained optimization. There are ways around simple positivity constraints such as the ones here. We can instead define new variables \(\xi_\alpha = \ln \alpha\) and \(\xi_b = \ln b\), and write the log-likelihood in terms of these variables instead. We then find minimizing \(\xi_\alpha\) and \(\xi_b\) and convert them to \(\alpha\) and \(\beta\) by exponentiation after performing the minimization calculation.

Numerical optimization is implemented in the scipy.optimize submodule (docs). Most of the functionality you need is in the scipy.optimize.minimize() function. To use the function to find minimizers of an objective function, the standard call signature is

scipy.optimize.minimize(fun, x0, args=(), method='BFGS')

The fun argument is a function with call signature fun(x, *args), where x is the variables used in the optimization. In the case of MLE, the function is the negative log-likelihood function, x is always an array of the parameter values we are trying to estimate, and the remaining arguments are additional arguments passed into the likelihood function, which always include the measured data. Importantly, we have to provide a guess as to which values of the parameters are optimal. This is passed as an array x0. The kwarg args specifies which additional arguments are to be passed to fun(). Note that args must be a tuple. Finally, the method keyword argument specifies which numerical optimization method to use, the default being the Broyden–Fletcher–Goldfarb–Shanno algorithm. This is a good algorithm but does compute derivatives, so it is only useful if the parameter values can take on any real value.

I have omitted the bounds keyword argument here because we will not usually use them, as we will either do the logarithm trick above, or use Powell’s method, which does not required calculation of derivatives (so we may therefore have discontinuities in the objective function and set the value of the objective function to be infinity for disallowed parameter values).

Guess at optimal parameters

We will use the method of moments to guess the optimal parameters. Referring to the Distribution Explorer for the first moment and variance, we can equate them with the plug-in estimates.

\begin{align} &\langle n \rangle = \frac{\alpha}{\beta} = \alpha b = \bar{n},\\[1em] &\sigma^2 = \frac{\alpha(1+\beta)}{\beta^2} = \alpha b(1+b) = \hat{\sigma}^2. \end{align}

Solving gives our method-of-moments estimates,

\begin{align} &\alpha_\mathrm{mom} = \frac{\bar{n}}{b} = \frac{\bar{x}^2}{\hat{\sigma}^2 - \bar{n}},\\[1em] &b_\mathrm{mom} = \frac{\hat{\sigma}^2}{\bar{n}} - 1. \end{align}

We can compute the estimates from the plug-in values.

[7]:

# Extract the values for Nanog
n = df.get_column('Nanog').to_numpy()

# Compute
b_mom = np.var(n) / np.mean(n) - 1
alpha_mom = np.mean(n) / b_mom

# Take a look
print(f"α_mom = {alpha_mom}\nb_mom = {b_mom}")

α_mom = 1.7700110004078886
b_mom = 49.569381904553985

Note that the method-of-moments is not the only way to come up with guesses. As an example, if the data are not actually generated from a Negative Binomial distribution (or even if they are and there are not too many data points), then we may have \(\hat{\sigma} < \bar{n}\), which would result in a negative estimate for both \(\alpha\) and \(b\), which are not even allowed values of the parameters. It turns out that in this case, the optimization works just fine if we arbitrarily pick something like \(\alpha \approx \beta \approx 3\).

Solving with the BFGS algorithm

We will now solve the minimization problem using the BFGS algorithm, specifying the parameters using logarithms to make sure that the problem is completely unconstrained. First, we have to write a function for the log-likelihood matching the required function signature of the input fun to scipy.optimize.minimize(). Note that we do not have to hand-code the log-likelihood. The scipy.stats module has functions to compute the log-PDF/log-PMF for many distributions. We just need to check the Distribution Explorer to ensure we use the parametrization that the scipy.stats module requires. In this case, it expects parameters alpha and 1/1+b.

[8]:

def log_like_iid_nbinom_log_params(log_params, n):
    """Log likelihood for i.i.d. NBinom measurements with
    input being logarithm of parameters.

    Parameters
    ----------
    log_params : array
        Logarithm of the parameters alpha and b.
    n : array
        Array of counts.

    Returns
    -------
    output : float
        Log-likelihood.
    """
    log_alpha, log_b = log_params

    # Convert from log parameters
    alpha = np.exp(log_alpha)
    b = np.exp(log_b)

    return np.sum(st.nbinom.logpmf(n, alpha, 1/(1+b)))

With the log likelihood specified, we simply use -log_like_iid_nbinom_params() as our objective function, which we can succinctly code up as an anonymous (lambda) function. Let’s perform the optimization for the nanog gene and look at the result.

[9]:

# Solve, making sure to use log parameters
res = scipy.optimize.minimize(
    fun=lambda log_params, n: -log_like_iid_nbinom_log_params(log_params, n),
    x0=np.array([np.log(alpha_mom), np.log(b_mom)]),
    args=(n,),
    method='BFGS'
)

res

[9]:

  message: Desired error not necessarily achieved due to precision loss.
  success: False
   status: 2
      fun: 1524.9284357729396
        x: [ 2.338e-01  4.241e+00]
      nit: 11
      jac: [ 3.204e-04  1.068e-04]
 hess_inv: [[ 8.545e-09  2.968e-07]
            [ 2.968e-07  1.043e-05]]
     nfev: 308
     njev: 96

The result returned by scipy.optimize.minimize() is an OptimizeResult object that has several attributes about how the optimization calculation went, including if it was successful. Importantly, the optimal log-parameter values are in the array x. We can extract them and exponentiate them to get the MLE.

[10]:

alpha_mle, b_mle = np.exp(res.x)

print("α: ", alpha_mle)
print("b: ", b_mle)

α:  1.263433985420747
b:  69.44437003467432

So, the MLE for the burst frequency is about 1.25 inverse degradation times. The MLE for the burst size is about 70 transcripts per burst.

Solving using Powell’s method

As an alternative to the BFGS method, we can use Powell’s method. This has the advantage that we do not have to use derivatives in the optimization, so we do not have to use logarithms of the parameters. We do, however, need to specify that the log-likelihood is minus infinity for disallowed parameter values.

[11]:

def log_like_iid_nbinom(params, n):
    """Log likelihood for i.i.d. NBinom measurements."""
    alpha, b = params

    if alpha <= 0 or b <= 0:
        return -np.inf

    return np.sum(st.nbinom.logpmf(n, alpha, 1/(1+b)))

We take a similar approach to solving using Powell’s method. This time, we will catch warnings because the solver will stumble into regions where the log-likelihood is minus infinity. We know this to be the case, as we designed it that way, so we will suppress the warnings to keep our notebook clean. We will tighten the tolerance on the solver using the tol keyword argument to insist on very small gradients when finding the minimum. This results in a more accurate MLE at the cost of more computational effort.

[12]:

with warnings.catch_warnings():
    warnings.simplefilter("ignore")

    res = scipy.optimize.minimize(
        fun=lambda params, n: -log_like_iid_nbinom(params, n),
        x0=np.array([alpha_mom, b_mom]),
        args=(n,),
        method='Powell',
        tol=1e-6,
    )

if res.success:
    alpha_mle, b_mle = res.x
else:
    raise RuntimeError('Convergence failed with message', res.message)

print("α: ", alpha_mle)
print("b: ", b_mle)

α:  1.2634329776733069
b:  69.44440337333538

This differs from the result we got with BFGS in the third or fourth decimal place, due to inaccuracies in introducing the logarithms, but the difference is not big and also is small compared to the confidence interval we will for the MLEs of α and b in a subsequent lesson.

The likelihood function

To help give a picture of what the likelihood function looks like, and what the optimizer is doing, we can plot it. In this case, we have two parameters, so we can make a contour plot. We first compute the log-likelihood for various values of \(\alpha\) and \(b\).

[13]:

# alpha and b values for plotting
alpha = np.linspace(1, 1.5, 100)
b = np.linspace(50, 90, 100)

# Compute log-likelihood for each value
log_like = np.empty((100, 100))
for j, alpha_val in enumerate(alpha):
    for i, b_val in enumerate(b):
        log_like[i, j] = log_like_iid_nbinom((alpha_val, b_val), n)

Remember that the likelihood function is not a probability distribution, so it is not normalized. When we exponentiate the log-likelihood, we may get values close to zero, or very large. It is therefore a good idea to first subtract the maximal value of all computed log-likelihoods. This has the effect of multiplying the likelihood function by a constant.

[14]:

like = np.exp(log_like - log_like.max())

Now, we can make a contour plot using the bebi103.viz.contour() function.

[15]:

p = bebi103.viz.contour(alpha, b, like, overlaid=True, x_axis_label="α*", y_axis_label="b*")

bokeh.io.show(p)

Graphically, we can see that we appropriately found the maximum.

A quick visualization

We can do a quick visualization of our MLE to see if the model holds up. We will talk more about graphical tests of model predictive accuracy in coming lessons, but for now, we will simply overlay the theoretical CDF parametrized by the MLE. We can conveniently use the scipy.stats module to generate the CDF. It is probably overkill, since we have such a wide range of mRNA counts, but we will take care to make sure we plot the theoretical CDF as a staircase, since it is for a discrete distribution.

[16]:

p = iqplot.ecdf(data=df['Nanog'], q='Nanog', conf_int=True)
n_theor = np.arange(0, df['Nanog'].max()+1)
cdf_theor = st.nbinom.cdf(n_theor, alpha_mle, 1/(1+b_mle))

# Weave together to make staircase for discrete distribution
n_plot = np.empty(2 * len(n_theor))
cdf_plot = np.empty(2 * len(n_theor))
cdf_plot[0] = 0
cdf_plot[1::2] = cdf_theor
cdf_plot[2::2] = cdf_theor[:-1]
n_plot[::2] = n_theor
n_plot[1::2] = n_theor

p.line(n_plot, cdf_plot, line_color='orange', line_width=2)
bokeh.io.show(p)

The MLE curve deviates from the nonparametric ECDF 95% confidence interval. This suggests we may be missing something in our model. We will cover this in more depth in future lessons.

Computing environment

[17]:

%load_ext watermark
%watermark -v -p numpy,scipy,polars,bokeh,iqplot,bebi103,jupyterlab

Python implementation: CPython
Python version       : 3.12.3
IPython version      : 8.25.0

numpy     : 1.26.4
scipy     : 1.13.1
polars    : 1.1.0
bokeh     : 3.4.1
iqplot    : 0.3.7
bebi103   : 0.1.21
jupyterlab: 4.0.13