Maximum likelihood estimation for variate-covariate models

In the last part of this lesson, we discussed how to write down a generative model for variate-covariate models. To obtain estimates for the parameters of the model, we can perform a maximum likelihood estimation, precisely as we have done for other models; we minimize the log likelihood function. The optimization is typically done using numerical optimization, as we have already seen in the Numerical maximum likelihood estimation lesson.

For some special cases of variate-covariate models, we can work out analytical results that can greatly ease the optimization problem. Fortunately, these special case arise quite frequently.

In what follows, we will use a general expression for the theoretical model. Let \(y(x;\theta)\) be the theoretical curve with parameters \(\theta = (\theta_1, \theta_2, \ldots)\), and let \((x_i,y_i)\) be a variate-covariate pair of measurements. We assume was have \(n\) of them. We will also everywhere assume the data points are i.i.d.

To connect this more general notation to the problem of mitotic spindle size we have been working with, \((x_i, y_i) = (d_i, l_i)\), \(\theta = (\gamma, \phi)\), and

\[\begin{align} y(x;\theta) = l(d;\gamma, \phi) = \frac{\gamma d}{\left(1+(\gamma d/\phi)^3\right)^{\frac{1}{3}}}. \end{align}\]

Normally distributed residuals

The results we derive here apply to the case where we have Normally (possibly heteroscedastic) distributed residuals. In the most general case, our model is

\[\begin{align} y_i \sim \text{Norm}(y(x_i;\theta), \sigma_i). \end{align}\]

Actually, this is not the most general case, since in general \(x_i\) may be multi-dimensional, but we are considering scalar \(x_i\) for simplicity. We can write the likelihood as

\[\begin{align} L(\theta, \{\sigma_i\}; \{x_i\}, \{y_i\}) = \prod_i \frac{1}{\sqrt{2\pi\sigma_i^2}}\,\exp\left[-\frac{(y_i - y(x_i;\theta))^2}{2\sigma_i^2}\right], \end{align}\]

where \(\{\sigma_i\}\), \(\{x_i\}\), and \(\{y_i\}\) represent the sets of :math:sigma, \(x\) and \(y\): values, respectively. The log-likelihood is then

\[\begin{align} \ell(\theta, \{\sigma_i\}; \{x_i\}, \{y_i\}) = -\frac{n}{2}\,\ln 2\pi - \sum_i \ln \sigma_i - \frac{1}{2}\sum_i \left(\frac{y_i - y(x_i;\theta)}{\sigma_i}\right)^2. \end{align}\]

This is a general log-likelihood for Normally distributed residuals. We will now investigate special cases.

Known \(\sigma_i\) and the Levenberg-Marquardt algorithm

In some cases, we know the values of the \(\sigma_i\)’s a priori. That is, they are not parameters we need estimates for, but are already known. While I often see this in the biological literature, I seldom come across examples like this in my own work. Typically, a measurement for a given \(x_i\) is repeated multiple times and then \(\sigma_i\) is estimated. Nonetheless, treating this case is informative, as it introduces to a useful algorithm.

The log-likelihood no longer depends on the \(\sigma_i\)’s (since they are known), and is

\[\begin{align} \ell(\theta; \{x_i\}, \{y_i\}) = -\frac{n}{2}\,\ln 2\pi - \sum_i \ln \sigma_i - \frac{1}{2}\sum_i \left(\frac{y_i - y(x_i;\theta)}{\sigma_i}\right)^2. \end{align}\]

Since the first two terms in the above expression for the log-likelihood have no θ-dependence, we may instead seek a value of \(\theta\) that minimizes the quantity

\[\begin{align} \sum_i \left(\frac{y_i - y(x_i;\theta)}{\sigma_i}\right)^2. \end{align}\]

This has a convenient structure to utilize the Levenberg-Marquardt algorithm. It robustly and efficiently solves optimization problems of the form

\[\begin{align} \arg \min_z \sum_i \left(f_i(z)\right)^2, \end{align}\]

with each \(f_i\) being scalar-valued and \(z\) being scalar or vector-valued. 1 The maximum likelihood estimation problem is of this form, with \(z = \theta\) and

\[\begin{align} f_i(\theta) = \left(\frac{y_i - y(x_i;\theta)}{\sigma_i}\right)^2. \end{align}\]

So, to find the MLE for the parameters \(\theta\), we can directly employ the Levenberg-Marquardt algorithm. We will discuss implementation using Python-based tools in a forthcoming lesson.

Note that if the \(\sigma_i\) values are not known, the MLE problem does not match the class of optimization problems for which the Levenberg-Marquardt algorithm can be used.

Homoscedastic Normal models

Now, consider a homoscedastic model,

\[\begin{align} y_i \sim \text{Norm}(y(x_i;\theta), \sigma) \;\forall i. \end{align}\]

Now, the log likelihood function is

\[\begin{align} \ell(\theta, \sigma; \{x_i\}, \{y_i\}) = -\frac{n}{2}\,\ln 2\pi - n \ln \sigma - \frac{1}{2\sigma}\sum_i \left(y_i - y(x_i;\theta)\right)^2. \end{align}\]

Even though we do not know the value of \(\sigma\), we can still get a maximum likelihood estimate for \(\theta\) without considering \(\sigma\) because of the structure of the log likelihood. Regardless the value of \(\sigma\), the values of \(\theta\) that minimize

\[\begin{align} \text{RSS} \equiv \sum_i \left(y_i - y(x_i;\theta)\right)^2, \end{align}\]

a quantity known as the residual sum of squares, give the maximum likelihood estimate. The MLE for \(\theta\) can therefore be found using the Levenberg-Marquardt algorithm.

Assume for a moment we find the MLE \(\theta^*\), which gives us a residual sum of squares of \(\text{RSS}^*\). Then, the log-likelihood is

\[\begin{align} \ell(\theta^*, \sigma; \{x_i\}, \{y_i\}) = -\frac{n}{2}\,\ln 2\pi - n\ln\sigma - \frac{\text{RSS}^*}{2\sigma^2}. \end{align}\]

The value of \(\sigma\) that maximizes the log-likelihood is then found by differentiating and setting to zero.

\[\begin{align} \frac{\partial \ell}{\partial \sigma} = -\frac{n}{\sigma} + \frac{\text{RSS}^*}{\sigma^3} = 0. \end{align}\]

This is solved to give \(\sigma^* = \sqrt{\text{RSS}^*/n}\).

So, we can split the optimization problem in two parts. First, find the values of parameters \(\theta\) that minimize the residual sum of squares, typically using the Levenberg-Marquardt algorithm. Then, using these values, compute the MLE for \(\sigma\) using the analytical result we have just derived.

A very important caveat

Though commonly encountered, the special case we have just considered is not always the variate-covariate model we might choose. That means that the convenience of the Levenberg-Marquardt algorithm and the convenient result for getting a maximum likelihood estimate for the homoscedastic \(\sigma\) will not in general work. They only work for this special case.

Theoretical functions that are linear in the parameters

Like other optimization algorithms we have seen so far, the Levenberg-Marquardt algorithm will only provide a local minimizer; it is not guaranteed to find a global minimum. That said, if the function \(y(x;\theta)\) is linear in the parameters \(\theta\), it can be shown that, under certain restrictions, the minimizer \(\theta^*\) of the RSS is unique, and therefore a local minimum is also a global minimum.

To show this, since \(y(x;\theta)\) is linear in \(\theta\), we can write it in the form \(y(x;\theta) = \mathbf{z} \cdot \theta\). 2 For example, if \(y(x;\theta)\) is a second order polynomial, \(y(x;\theta) = \theta_1 + \theta_2 x + \theta_3 x^2\), then \(\mathbf{z} = (1, x, x^2)\) and \(\theta = (\theta_1, \theta_2, \theta_3)^\mathsf{T}\). Residual \(i\) is \(y_i - \mathbf{z}_i \cdot \theta\), and the residual sum of squares is

\[\begin{align} \text{RSS} = \sum_i (y_i - \mathbf{z}_i \theta)^2, \end{align}\]

where \(\mathbf{z}_i = (1, x_i, x_i^2)\) for the second order polynomial example, and we have omitted the \(\cdot\) symbol, and will continue to do so for notational concision. We can write this in a more convenient form by defining \(\mathbf{y} = (y_1, y_2, \ldots y_n)^\mathsf{T}\). Further, for each data point \(i\), there is a \(z_i\). We define the matrix

\[\begin{split}\begin{align} \mathsf{Z} = \begin{pmatrix} \mathbf{z}_1\\ \mathbf{z}_2\\ \vdots\\ \mathbf{z}_n \end{pmatrix}. \end{align}\end{split}\]

The matrix \(\mathsf{Z}\) is sometimes referred to as a design matrix. The residual sum of squares is then

\[\begin{split}\begin{align} \text{RSS} &= (\mathbf{y} - \mathsf{Z} \theta)^\mathsf{T}(\mathbf{y} - \mathsf{Z} \theta) \\[1em] &= \mathbf{y}^\mathsf{T} \mathbf{y} - \mathbf{y}^\mathsf{T} \mathsf{Z} \theta - (\mathsf{Z} \theta)^\mathsf{T}\mathbf{y} + (\mathsf{Z}\theta)^\mathsf{T} (\mathsf{Z}\theta) \\[1em] &= \mathbf{y}^\mathsf{T} \mathbf{y} - 2\theta^\mathsf{T}\mathsf{Z}^\mathsf{T} \mathbf{y} + \theta^\mathsf{T}\mathsf{Z}^\mathsf{T} \mathsf{Z}\theta. \end{align}\end{split}\]

Differentiating the RSS with respect to \(\theta\) gives

\[\begin{align} \nabla_\theta \text{RSS} = -2\mathsf{Z}^\mathsf{T}\mathbf{y} + 2\mathsf{Z}^\mathsf{T} \mathsf{Z}\theta. \end{align}\]

Setting this equal to zero and solving for \(\theta^*\) (the value of \(\theta\) where \(\nabla_\theta \text{RSS} = \mathbf{0}\)) yields

\[\begin{align} \theta^* = (\mathsf{Z}^\mathsf{T} \mathsf{Z})^{-1}\mathsf{Z}^\mathsf{T}\mathbf{y}. \end{align}\]

Provided the matrix \(\mathsf{Z}^\mathsf{T}\mathsf{Z}\) is nonsingular, this linear system has a unique solution.

This is referred to as ordinary least squares, or OLS, and is a subject in many introductory linear algebra courses. Though \(\theta^*\) may be computed using matrix operations according to the above formula, in practice it is often simply calculated using the Levenberg-Marquardt algorithm, which is typically more numerically stable. Because of what we have just shown, whatever local minimum the algorithm finds is, in fact, a global minimum and therefore the unique MLE for \(\theta\).

Note that this is not generally true! There may be local minima of the RSS as a function of \(\theta\) (and therefore local minima of the log likelihood) is \(y(x;\theta)\) is nonlinear in the parameters \(\theta\).

1

Note, though, in the implementation we will use in scipy.optimize.least_squares(), \(z\) must be entered as an array, even for a single parameter.

2

In most books and other literature on the subject, the vector we have defined as \(\mathbf{z}\) is denoted \(\mathbf{x}\). I find that variables named \(x\) abound and the symbol is overloaded, which is why I defined it as \(\mathbf{z}\) and the design matrix as \(\mathsf{Z}\) instead of \(\mathsf{X}\) (though the design matrix is often also denoted by \(\mathsf{A}\)).