Numerical maximum likelihood estimation
[2]:
import itertools
import warnings
import numpy as np
import polars as pl
import scipy.optimize
import scipy.stats as st
import iqplot
import bebi103
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
As discussed in the previous lesson, maximum likelihood estimates for parameters may sometimes be computed analytically, but often cannot. In those cases, we need to resort to numerical methods. In this lesson, we demonstrate some numerical methods to perform maximum likelihood estimates of parameters form a Negative Binomial distribution.
The data set
The data come from the Elowitz lab, published in Singer et al., Dynamic Heterogeneity and DNA Methylation in Embryonic Stem Cells, Molec. Cell, 55, 319-331, 2014, available here.
In this paper, the authors investigated cell populations of embryonic stem cells using RNA single molecule fluorescence in situ hybridization (smFISH), a technique that enables them to count the number of mRNA transcripts in a cell for a given gene. They were able to measure four different genes in the same cells. So, for one experiment, they get the counts of four different genes in a collection of cells.
The authors focused on genes that code for pluripotency-associated regulators to study cell differentiation. Indeed, differing gene expression levels are a hallmark of differentiated cells. The authors do not just look at counts in a given cell at a given time. The temporal nature of gene expression is also important. While the authors do not directly look at temporal data using smFISH (since the technique requires fixing the cells), they did look at time lapse fluorescence movies of other regulators. We will not focus on these experiments here, but will discuss how the distribution of mRNA counts acquired via smFISH can serve to provide some insight about the dynamics of gene expression.
The data set we are analyzing now comes from an experiment where smFISH was performed in 279 cells for the genes rex1, rest, nanog, and prdm14. The data set may be downloaded at https://s3.amazonaws.com/bebi103.caltech.edu/data/singer_transcript_counts.csv.
Exploratory data analysis
We first load in the data set and generate ECDFs for the mRNA counts for each of the four genes.
[3]:
# Load DataFrame
df = pl.read_csv(os.path.join(data_path, 'singer_transcript_counts.csv'), comment_prefix='#')
# Take a look
df.head()
[3]:
Rex1 | Rest | Nanog | Prdm14 |
---|---|---|---|
i64 | i64 | i64 | i64 |
11 | 34 | 39 | 0 |
172 | 91 | 33 | 5 |
261 | 70 | 68 | 0 |
178 | 54 | 88 | 1 |
129 | 54 | 41 | 0 |
Each of the 279 rows has the mRNA counts for each of the four genes. There may be multiple cell types present, and we do now know how many there are. Our aim here is not to find how many cell types there are, but to demonstrate how MLE works. Nonetheless, we should have some idea of the properties of the data set we are exploring. We can start by plotting ECDFs for each of the four genes. It is useful to have the gene names around for iteration here, and throughout the lesson.
[4]:
genes = ["Nanog", "Prdm14", "Rest", "Rex1"]
plots = [
iqplot.ecdf(
data=df.get_column(gene),
q=gene,
x_axis_label="mRNA count",
title=gene,
frame_height=150,
frame_width=200,
)
for gene in genes
]
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))
Note the difference in the x-axis scales. Clearly, prdm14 has far fewer mRNA copies than the other genes. The presence of two inflection points in the Rex1 EDCF implies bimodality, leading us to suspect that there may be two cell types, or at least more than one cell type. We can plot all pairwise combinations of gene counts to explore further.
For visualizing the data, it will be useful to have labels for the cells so we can compare expression levels in all four cells at one. We will label them according to the Rex1 levels.
[5]:
# Add cell label, ranked lowest to highest in Rex1 expression
df = df.with_columns(pl.col('Rex1').rank(method='dense').alias('cell'))
# Add colors for plotting, using quantitative to color conversion in bebi103 package
df = df.with_columns(pl.Series(name='color', values=bebi103.viz.q_to_color(df["cell"], bokeh.palettes.Viridis256)))
We will now look at all of the pair-wise plots of mRNA expression. We color the dots according to cell ID, with cell 1 having the lowest count of Rex1 mRNA and the last cell having the highest.
[6]:
def pairwise_plot(df, gene1, gene2):
p = bokeh.plotting.figure(
frame_height=150,
frame_width=150,
x_axis_label=gene1,
y_axis_label=gene2,
)
p.scatter(source=df.to_dict(), x=gene1, y=gene2, color="color", size=2)
return p
plots = [
pairwise_plot(df, gene1, gene2) for gene1, gene2 in itertools.combinations(genes, 2)
]
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=3))
It appears as though there is a correlation between Rest, Nanog, and Rex1. They tend to be high or low together. Prdm14, on the other hand, shows less correlation.
Model for mRNA levels
In this part of the lesson, we will model gene expression of each of the four genes separately, though they are connected by which cell is being measured. We will discuss that later. For now, we develop a model for the mRNA counts for a given gene.
If gene expression is a purely Poisson process, we might expect a Poisson distribution. Or, if the copy number is itself somehow tightly regulated, we might expect a Normal distribution.
Study of gene expression dynamics, largely through fluorescence imaging, has lead to a different story. Expression of many important genes can be bursty, which means that the promoter is on for a period of time in which transcripts are made, and then it is off for a while. The “on” periods are called “bursts” and are themselves well-modeled as a Poisson process. That is to say that the amount of time that a promoter is on is Exponentially distributed. Thus, we can think of a burst as a series of Bernoulli trials. A “failure” is production of an mRNA molecule, and a “success” is a switch to an off state. The number of “successes” we get is equal to the number of bursts we get per decay time of the mRNA. We can define the number of bursts before degradation of the mRNA as \(\alpha\). This is the so-called burst frequency. So, we have a series of Bernoulli trials and we wait for \(\alpha\) successes. Then, \(n\), the total number of failures (which is the number of mRNA transcripts), is Negative Binomially distributed, since this matches the Negative Binomial story. Referring to the parametrization used in the distribution explorer,
\begin{align} n \sim \text{NBinom}(\alpha, \beta), \end{align}
where \(\beta\) is related to the probability \(\theta\) of a success of a Bernoulli trial by \(\theta = \beta/(1+\beta)\).
The meaning of the parameter \(\beta\), and the related quantity \(\theta\), can be a little mystical here. We would like to relate it to the typical burst size, i.e., the typical number of transcripts made per burst. The size of a single given burst (that is, the number of transcripts made in a burst) is geometrically distributed (since it matches that story), so
\begin{align} f(n_\mathrm{burst} ; \theta) = (1-\theta)^{n_\mathrm{burst}}\,\theta. \end{align}
The mean number of transcripts \(b\) in a burst is
\begin{align} b \equiv \left\langle n_\mathrm{burst}\right\rangle &= \sum_{n_\mathrm{burst}=0}^\infty n_\mathrm{burst}(1-\theta)^{n_\mathrm{burst}}\theta\\[1em] &= \theta \sum_{n_\mathrm{burst}=0}^\infty n_\mathrm{burst}(1-\theta)^{n_\mathrm{burst}} \\[1em] &= \theta(1-\theta)\, \frac{\mathrm{d}}{\mathrm{d}(1-\theta)}\sum_{n_\mathrm{burst}=0}^\infty(1-\theta)^{n_\mathrm{burst}} \\[1em] &= \theta(1-\theta)\, \frac{\mathrm{d}}{\mathrm{d}(1-\theta)}\,\frac{1}{\theta}\\[1em] &= -\theta(1-\theta)\, \frac{\mathrm{d}}{\mathrm{d}\theta}\,\frac{1}{\theta} \\[1em] &= \frac{1-\theta}{\theta} \\[1em] &= \frac{1}{\beta}. \end{align}
So we now see that \(1/\beta\) is the typical burst size. Using the Negative Binomial property of mRNA copy numbers of bursty gene expression, we can characterize the expression levels of a given cell type by the two parameters of the Negative Binomial, the burst frequency \(\alpha\) and the burst size \(b = 1/\beta\). These are the two parameters we would like to infer from transcript count data. The conclusion of all this is that we have have our likelihood.
\begin{align} &n \sim \text{NBinom}(\alpha, \beta),\\[1em] &b = 1/\beta. \end{align}
Maximum likelihood estimation by numerical optimization
To compute the MLE for the two parameters, the burst frequency \(\alpha\) and burst size \(\beta\), we need to define the likelihood function. We make the assumption that the number of transcripts in each cell is i.i.d., giving a statistical model of
\begin{align} n_i \sim \text{NBinom}(\alpha,\beta)\;\forall i. \end{align}
Referring to the PMF of the Negative Binomial distribution and making the change of variables \(b=1/\beta\), the likelihood function is
\begin{align} L(\alpha, b;\mathbf{n}) = \prod_i\frac{\Gamma(n_i+\alpha)}{\Gamma(\alpha)n!}\left(\frac{1}{1+b}\right)^\alpha\left(\frac{b}{1+b}\right)^{n_i}, \end{align}
and the log-likelihood is
\begin{align} \ell(\alpha, b;\mathbf{n}) = \ln L(\alpha, b;\mathbf{n}) = \sum_i \ln \left(\frac{\Gamma(n_i+\alpha)}{\Gamma(\alpha)n!}\left(\frac{1}{1+b}\right)^\alpha\left(\frac{b}{1+b}\right)^{n_i}\right). \end{align}
To find the MLE, we need to find the values of \(\alpha\) and \(b\) that satisfy
\begin{align} \frac{\partial \ell}{\partial \alpha} = \frac{\partial \ell}{\partial b} = 0. \end{align}
Unfortunately, no closed form solution exists for this. We therefore need to resort to numerical optimization to find the MLE \(\alpha^*\) and \(b^*\).
Numerical optimization
Numerical optimization is typically implemented to find a minimizers of a function rather than maximizers. The function being minimized is called an objective function. This is not a problem for maximum likelihood estimation; we simply define a negative log-likelihood as our objective function.
Sometimes, we have constraints on the allowed values for the parameters. In our case, both \(\alpha\) and \(\beta\) must be non-negative. So, the statement of the optimization problem to find the MLE is
\begin{align} \text{minimize } (-\ell(\alpha, \beta;\mathbf{n})) \text{ s.t. } \alpha, \beta > 0, \end{align}
where “s.t.” is read “subject to.” If we explicitly consider the constraints, we are performing a constrained optimization problem. Constrained optimization is often considerably more challenging than unconstrained optimization. There are ways around simple positivity constraints such as the ones here. We can instead define new variables \(\xi_\alpha = \ln \alpha\) and \(\xi_b = \ln b\), and write the log-likelihood in terms of these variables instead. We then find minimizing \(\xi_\alpha\) and \(\xi_b\) and convert them to \(\alpha\) and \(\beta\) by exponentiation after performing the minimization calculation.
Numerical optimization is implemented in the scipy.optimize
submodule (docs). Most of the functionality you need is in the scipy.optimize.minimize()
function. To use the function to find minimizers of an objective function, the standard call signature is
scipy.optimize.minimize(fun, x0, args=(), method='BFGS')
The fun
argument is a function with call signature fun(x, *args)
, where x
is the variables used in the optimization. In the case of MLE, the function is the negative log-likelihood function, x
is always an array of the parameter values we are trying to estimate, and the remaining arguments are additional arguments passed into the likelihood function, which always include the measured data. Importantly, we have to provide a guess as to which values of the parameters are optimal.
This is passed as an array x0
. The kwarg args
specifies which additional arguments are to be passed to fun()
. Note that args
must be a tuple. Finally, the method
keyword argument specifies which numerical optimization method to use, the default being the Broyden–Fletcher–Goldfarb–Shanno algorithm. This is a good algorithm but does compute derivatives, so it is only useful if the parameter
values can take on any real value.
I have omitted the bounds
keyword argument here because we will not usually use them, as we will either do the logarithm trick above, or use Powell’s method, which does not required calculation of derivatives (so we may therefore have discontinuities in the objective function and set the value of the objective function to be infinity for disallowed parameter values).
Guess at optimal parameters
We will use the method of moments to guess the optimal parameters. Referring to the Distribution Explorer for the first moment and variance, we can equate them with the plug-in estimates.
\begin{align} &\langle n \rangle = \frac{\alpha}{\beta} = \alpha b = \bar{n},\\[1em] &\sigma^2 = \frac{\alpha(1+\beta)}{\beta^2} = \alpha b(1+b) = \hat{\sigma}^2. \end{align}
Solving gives our method-of-moments estimates,
\begin{align} &\alpha_\mathrm{mom} = \frac{\bar{n}}{b} = \frac{\bar{x}^2}{\hat{\sigma}^2 - \bar{n}},\\[1em] &b_\mathrm{mom} = \frac{\hat{\sigma}^2}{\bar{n}} - 1. \end{align}
We can compute the estimates from the plug-in values.
[7]:
# Extract the values for Nanog
n = df.get_column('Nanog').to_numpy()
# Compute
b_mom = np.var(n) / np.mean(n) - 1
alpha_mom = np.mean(n) / b_mom
# Take a look
print(f"α_mom = {alpha_mom}\nb_mom = {b_mom}")
α_mom = 1.7700110004078886
b_mom = 49.569381904553985
Note that the method-of-moments is not the only way to come up with guesses. As an example, if the data are not actually generated from a Negative Binomial distribution (or even if they are and there are not too many data points), then we may have \(\hat{\sigma} < \bar{n}\), which would result in a negative estimate for both \(\alpha\) and \(b\), which are not even allowed values of the parameters. It turns out that in this case, the optimization works just fine if we arbitrarily pick something like \(\alpha \approx \beta \approx 3\).
Solving with the BFGS algorithm
We will now solve the minimization problem using the BFGS algorithm, specifying the parameters using logarithms to make sure that the problem is completely unconstrained. First, we have to write a function for the log-likelihood matching the required function signature of the input fun
to scipy.optimize.minimize()
. Note that we do not have to hand-code the log-likelihood. The scipy.stats
module has functions to compute the log-PDF/log-PMF for many distributions. We just need to check
the Distribution Explorer to ensure we use the parametrization that the scipy.stats
module requires. In this case, it expects parameters alpha
and 1/1+b
.
[8]:
def log_like_iid_nbinom_log_params(log_params, n):
"""Log likelihood for i.i.d. NBinom measurements with
input being logarithm of parameters.
Parameters
----------
log_params : array
Logarithm of the parameters alpha and b.
n : array
Array of counts.
Returns
-------
output : float
Log-likelihood.
"""
log_alpha, log_b = log_params
# Convert from log parameters
alpha = np.exp(log_alpha)
b = np.exp(log_b)
return np.sum(st.nbinom.logpmf(n, alpha, 1/(1+b)))
With the log likelihood specified, we simply use -log_like_iid_nbinom_params()
as our objective function, which we can succinctly code up as an anonymous (lambda) function. Let’s perform the optimization for the nanog gene and look at the result.
[9]:
# Solve, making sure to use log parameters
res = scipy.optimize.minimize(
fun=lambda log_params, n: -log_like_iid_nbinom_log_params(log_params, n),
x0=np.array([np.log(alpha_mom), np.log(b_mom)]),
args=(n,),
method='BFGS'
)
res
[9]:
message: Desired error not necessarily achieved due to precision loss.
success: False
status: 2
fun: 1524.9284357729396
x: [ 2.338e-01 4.241e+00]
nit: 11
jac: [ 3.204e-04 1.068e-04]
hess_inv: [[ 8.545e-09 2.968e-07]
[ 2.968e-07 1.043e-05]]
nfev: 308
njev: 96
The result returned by scipy.optimize.minimize()
is an OptimizeResult
object that has several attributes about how the optimization calculation went, including if it was successful. Importantly, the optimal log-parameter values are in the array x
. We can extract them and exponentiate them to get the MLE.
[10]:
alpha_mle, b_mle = np.exp(res.x)
print("α: ", alpha_mle)
print("b: ", b_mle)
α: 1.263433985420747
b: 69.44437003467432
So, the MLE for the burst frequency is about 1.25 inverse degradation times. The MLE for the burst size is about 70 transcripts per burst.
Solving using Powell’s method
As an alternative to the BFGS method, we can use Powell’s method. This has the advantage that we do not have to use derivatives in the optimization, so we do not have to use logarithms of the parameters. We do, however, need to specify that the log-likelihood is minus infinity for disallowed parameter values.
[11]:
def log_like_iid_nbinom(params, n):
"""Log likelihood for i.i.d. NBinom measurements."""
alpha, b = params
if alpha <= 0 or b <= 0:
return -np.inf
return np.sum(st.nbinom.logpmf(n, alpha, 1/(1+b)))
We take a similar approach to solving using Powell’s method. This time, we will catch warnings because the solver will stumble into regions where the log-likelihood is minus infinity. We know this to be the case, as we designed it that way, so we will suppress the warnings to keep our notebook clean. We will tighten the tolerance on the solver using the tol
keyword argument to insist on very small gradients when finding the minimum. This results in a more accurate MLE at the cost of more
computational effort.
[12]:
with warnings.catch_warnings():
warnings.simplefilter("ignore")
res = scipy.optimize.minimize(
fun=lambda params, n: -log_like_iid_nbinom(params, n),
x0=np.array([alpha_mom, b_mom]),
args=(n,),
method='Powell',
tol=1e-6,
)
if res.success:
alpha_mle, b_mle = res.x
else:
raise RuntimeError('Convergence failed with message', res.message)
print("α: ", alpha_mle)
print("b: ", b_mle)
α: 1.2634329776733069
b: 69.44440337333538
This differs from the result we got with BFGS in the third or fourth decimal place, due to inaccuracies in introducing the logarithms, but the difference is not big and also is small compared to the confidence interval we will for the MLEs of α and b in a subsequent lesson.
The likelihood function
To help give a picture of what the likelihood function looks like, and what the optimizer is doing, we can plot it. In this case, we have two parameters, so we can make a contour plot. We first compute the log-likelihood for various values of \(\alpha\) and \(b\).
[13]:
# alpha and b values for plotting
alpha = np.linspace(1, 1.5, 100)
b = np.linspace(50, 90, 100)
# Compute log-likelihood for each value
log_like = np.empty((100, 100))
for j, alpha_val in enumerate(alpha):
for i, b_val in enumerate(b):
log_like[i, j] = log_like_iid_nbinom((alpha_val, b_val), n)
Remember that the likelihood function is not a probability distribution, so it is not normalized. When we exponentiate the log-likelihood, we may get values close to zero, or very large. It is therefore a good idea to first subtract the maximal value of all computed log-likelihoods. This has the effect of multiplying the likelihood function by a constant.
[14]:
like = np.exp(log_like - log_like.max())
Now, we can make a contour plot using the bebi103.viz.contour()
function.
[15]:
p = bebi103.viz.contour(alpha, b, like, overlaid=True, x_axis_label="α*", y_axis_label="b*")
bokeh.io.show(p)
Graphically, we can see that we appropriately found the maximum.
A quick visualization
We can do a quick visualization of our MLE to see if the model holds up. We will talk more about graphical tests of model predictive accuracy in coming lessons, but for now, we will simply overlay the theoretical CDF parametrized by the MLE. We can conveniently use the scipy.stats
module to generate the CDF. It is probably overkill, since we have such a wide range of mRNA counts, but we will take care to make sure we plot the theoretical CDF as a staircase, since it is for a discrete
distribution.
[16]:
p = iqplot.ecdf(data=df['Nanog'], q='Nanog', conf_int=True)
n_theor = np.arange(0, df['Nanog'].max()+1)
cdf_theor = st.nbinom.cdf(n_theor, alpha_mle, 1/(1+b_mle))
# Weave together to make staircase for discrete distribution
n_plot = np.empty(2 * len(n_theor))
cdf_plot = np.empty(2 * len(n_theor))
cdf_plot[0] = 0
cdf_plot[1::2] = cdf_theor
cdf_plot[2::2] = cdf_theor[:-1]
n_plot[::2] = n_theor
n_plot[1::2] = n_theor
p.line(n_plot, cdf_plot, line_color='orange', line_width=2)
bokeh.io.show(p)
The MLE curve deviates from the nonparametric ECDF 95% confidence interval. This suggests we may be missing something in our model. We will cover this in more depth in future lessons.
Computing environment
[17]:
%load_ext watermark
%watermark -v -p numpy,scipy,polars,bokeh,iqplot,bebi103,jupyterlab
Python implementation: CPython
Python version : 3.12.3
IPython version : 8.25.0
numpy : 1.26.4
scipy : 1.13.1
polars : 1.1.0
bokeh : 3.4.1
iqplot : 0.3.7
bebi103 : 0.1.21
jupyterlab: 4.0.13