Maximum likelihood estimation for variate-covariate models
In the last part of this lesson, we discussed how to write down a generative model for variate-covariate models. To obtain estimates for the parameters of the model, we can perform a maximum likelihood estimation, precisely as we have done for other models; we minimize the log likelihood function. The optimization is typically done using numerical optimization, as we have already seen in the Numerical maximum likelihood estimation lesson.
For some special cases of variate-covariate models, we can work out analytical results that can greatly ease the optimization problem. Fortunately, these special case arise quite frequently.
In what follows, we will use a general expression for the theoretical model. Let \(y(x;\theta)\) be the theoretical curve with parameters \(\theta = (\theta_1, \theta_2, \ldots)\), and let \((x_i,y_i)\) be a variate-covariate pair of measurements. We assume was have \(n\) of them. We will also everywhere assume the data points are i.i.d.
To connect this more general notation to the problem of mitotic spindle size we have been working with, \((x_i, y_i) = (d_i, l_i)\), \(\theta = (\gamma, \phi)\), and
Normally distributed residuals
The results we derive here apply to the case where we have Normally (possibly heteroscedastic) distributed residuals. In the most general case, our model is
Actually, this is not the most general case, since in general \(x_i\) may be multi-dimensional, but we are considering scalar \(x_i\) for simplicity. We can write the likelihood as
where \(\{\sigma_i\}\), \(\{x_i\}\), and \(\{y_i\}\) represent the sets of :math:sigma, \(x\) and \(y\): values, respectively. The log-likelihood is then
This is a general log-likelihood for Normally distributed residuals. We will now investigate special cases.
Known \(\sigma_i\) and the Levenberg-Marquardt algorithm
In some cases, we know the values of the \(\sigma_i\)’s a priori. That is, they are not parameters we need estimates for, but are already known. While I often see this in the biological literature, I seldom come across examples like this in my own work. Typically, a measurement for a given \(x_i\) is repeated multiple times and then \(\sigma_i\) is estimated. Nonetheless, treating this case is informative, as it introduces to a useful algorithm.
The log-likelihood no longer depends on the \(\sigma_i\)’s (since they are known), and is
Since the first two terms in the above expression for the log-likelihood have no θ-dependence, we may instead seek a value of \(\theta\) that minimizes the quantity
This has a convenient structure to utilize the Levenberg-Marquardt algorithm. It robustly and efficiently solves optimization problems of the form
with each \(f_i\) being scalar-valued and \(z\) being scalar or vector-valued. [1] The maximum likelihood estimation problem is of this form, with \(z = \theta\) and
So, to find the MLE for the parameters \(\theta\), we can directly employ the Levenberg-Marquardt algorithm. We will discuss implementation using Python-based tools in a forthcoming lesson.
Note that if the \(\sigma_i\) values are not known, the MLE problem does not match the class of optimization problems for which the Levenberg-Marquardt algorithm can be used.
Homoscedastic Normal models
Now, consider a homoscedastic model,
Now, the log likelihood function is
Even though we do not know the value of \(\sigma\), we can still get a maximum likelihood estimate for \(\theta\) without considering \(\sigma\) because of the structure of the log likelihood. Regardless the value of \(\sigma\), the values of \(\theta\) that minimize
a quantity known as the residual sum of squares, give the maximum likelihood estimate. The MLE for \(\theta\) can therefore be found using the Levenberg-Marquardt algorithm.
Assume for a moment we find the MLE \(\theta^*\), which gives us a residual sum of squares of \(\text{RSS}^*\). Then, the log-likelihood is
The value of \(\sigma\) that maximizes the log-likelihood is then found by differentiating and setting to zero.
This is solved to give \(\sigma^* = \sqrt{\text{RSS}^*/n}\).
So, we can split the optimization problem in two parts. First, find the values of parameters \(\theta\) that minimize the residual sum of squares, typically using the Levenberg-Marquardt algorithm. Then, using these values, compute the MLE for \(\sigma\) using the analytical result we have just derived.
A very important caveat
Though commonly encountered, the special case we have just considered is not always the variate-covariate model we might choose. That means that the convenience of the Levenberg-Marquardt algorithm and the convenient result for getting a maximum likelihood estimate for the homoscedastic \(\sigma\) will not in general work. They only work for this special case.
Theoretical functions that are linear in the parameters
Like other optimization algorithms we have seen so far, the Levenberg-Marquardt algorithm will only provide a local minimizer; it is not guaranteed to find a global minimum. That said, if the function \(y(x;\theta)\) is linear in the parameters \(\theta\), it can be shown that, under certain restrictions, the minimizer \(\theta^*\) of the RSS is unique, and therefore a local minimum is also a global minimum.
To show this, since \(y(x;\theta)\) is linear in \(\theta\), we can write it in the form \(y(x;\theta) = \mathbf{z} \cdot \theta\). [2] For example, if \(y(x;\theta)\) is a second order polynomial, \(y(x;\theta) = \theta_1 + \theta_2 x + \theta_3 x^2\), then \(\mathbf{z} = (1, x, x^2)\) and \(\theta = (\theta_1, \theta_2, \theta_3)^\mathsf{T}\). Residual \(i\) is \(y_i - \mathbf{z}_i \cdot \theta\), and the residual sum of squares is
where \(\mathbf{z}_i = (1, x_i, x_i^2)\) for the second order polynomial example, and we have omitted the \(\cdot\) symbol, and will continue to do so for notational concision. We can write this in a more convenient form by defining \(\mathbf{y} = (y_1, y_2, \ldots y_n)^\mathsf{T}\). Further, for each data point \(i\), there is a \(z_i\). We define the matrix
The matrix \(\mathsf{Z}\) is sometimes referred to as a design matrix. The residual sum of squares is then
Differentiating the RSS with respect to \(\theta\) gives
Setting this equal to zero and solving for \(\theta^*\) (the value of \(\theta\) where \(\nabla_\theta \text{RSS} = \mathbf{0}\)) yields
Provided the matrix \(\mathsf{Z}^\mathsf{T}\mathsf{Z}\) is nonsingular, this linear system has a unique solution.
This is referred to as ordinary least squares, or OLS, and is a subject in many introductory linear algebra courses. Though \(\theta^*\) may be computed using matrix operations according to the above formula, in practice it is often simply calculated using the Levenberg-Marquardt algorithm, which is typically more numerically stable. Because of what we have just shown, whatever local minimum the algorithm finds is, in fact, a global minimum and therefore the unique MLE for \(\theta\).
Note that this is not generally true! There may be local minima of the RSS as a function of \(\theta\) (and therefore local minima of the log likelihood) is \(y(x;\theta)\) is nonlinear in the parameters \(\theta\).