Maximum likelihood estimation for variate-covariate models
-----------------------------------------------------------

In the last part of this lesson, we discussed how to write down a generative model for variate-covariate models. To obtain estimates for the parameters of the model, we can perform a maximum likelihood estimation, precisely as we have done for other models; we minimize the log likelihood function. The optimization is typically done using numerical optimization, as we have already seen in the :ref:`Numerical maximum likelihood estimation` lesson.

For some special cases of variate-covariate models, we can work out analytical results that can greatly ease the optimization problem. Fortunately, these special case arise quite frequently.

In what follows, we will use a general expression for the theoretical model. Let :math:`y(x;\theta)` be the theoretical curve with parameters :math:`\theta = (\theta_1, \theta_2, \ldots)`, and let :math:`(x_i,y_i)` be a variate-covariate pair of measurements. We assume was have :math:`n` of them. We will also everywhere assume the data points are i.i.d.

To connect this more general notation to the problem of mitotic spindle size we have been working with, :math:`(x_i, y_i) = (d_i, l_i)`, :math:`\theta = (\gamma, \phi)`, and

.. math::

	\begin{align}
	y(x;\theta) = l(d;\gamma, \phi) = \frac{\gamma d}{\left(1+(\gamma d/\phi)^3\right)^{\frac{1}{3}}}.
	\end{align}


Normally distributed residuals
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The results we derive here apply to the case where we have Normally (possibly heteroscedastic) distributed residuals. In the most general case, our model is

.. math::

	\begin{align}
	y_i \sim \text{Norm}(y(x_i;\theta), \sigma_i).
	\end{align}

Actually, this is not the most general case, since in general :math:`x_i` may be multi-dimensional, but we are considering scalar :math:`x_i` for simplicity. We can write the likelihood as

.. math::

	\begin{align}
	L(\theta, \{\sigma_i\}; \{x_i\}, \{y_i\}) = \prod_i \frac{1}{\sqrt{2\pi\sigma_i^2}}\,\exp\left[-\frac{(y_i - y(x_i;\theta))^2}{2\sigma_i^2}\right],
	\end{align}

where :math:`\{\sigma_i\}`, :math:`\{x_i\}`, and :math:`\{y_i\}` represent the sets of :\math:`\sigma`, :math:`x` and :math:`y`: values, respectively. The log-likelihood is then

.. math::

	\begin{align}
	\ell(\theta, \{\sigma_i\}; \{x_i\}, \{y_i\}) = -\frac{n}{2}\,\ln 2\pi - \sum_i \ln \sigma_i - \frac{1}{2}\sum_i \left(\frac{y_i - y(x_i;\theta)}{\sigma_i}\right)^2.
	\end{align}

This is a general log-likelihood for Normally distributed residuals. We will now investigate special cases.


Known :math:`\sigma_i` and the Levenberg-Marquardt algorithm
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In some cases, we know the values of the :math:`\sigma_i`'s a priori. That is, they are not parameters we need estimates for, but are already known. While I often see this in the biological literature, I seldom come across examples like this in my own work. Typically, a measurement for a given :math:`x_i` is repeated multiple times and then :math:`\sigma_i` is estimated. Nonetheless, treating this case is informative, as it introduces to a useful algorithm.

The log-likelihood no longer depends on the :math:`\sigma_i`'s (since they are known), and is

.. math::

	\begin{align}
	\ell(\theta; \{x_i\}, \{y_i\}) = -\frac{n}{2}\,\ln 2\pi - \sum_i \ln \sigma_i - \frac{1}{2}\sum_i \left(\frac{y_i - y(x_i;\theta)}{\sigma_i}\right)^2.
	\end{align}

Since the first two terms in the above expression for the log-likelihood have no θ-dependence, we may instead seek a value of :math:`\theta` that *minimizes* the quantity

.. math::

	\begin{align}
	\sum_i \left(\frac{y_i - y(x_i;\theta)}{\sigma_i}\right)^2.
	\end{align}


This has a convenient structure to utilize the **Levenberg-Marquardt algorithm**. It robustly and efficiently solves optimization problems of the form

.. math::

	\begin{align}
	\arg \min_z \sum_i \left(f_i(z)\right)^2,
	\end{align}

with each :math:`f_i` being scalar-valued and :math:`z` being scalar or vector-valued. [#f1]_ The maximum likelihood estimation problem is of this form, with :math:`z = \theta` and 

.. math::

	\begin{align}
	f_i(\theta) = \left(\frac{y_i - y(x_i;\theta)}{\sigma_i}\right)^2.
	\end{align}

So, to find the MLE for the parameters :math:`\theta`, we can directly employ the Levenberg-Marquardt algorithm. We will discuss implementation using Python-based tools in a forthcoming lesson. 

Note that if the :math:`\sigma_i` values are not known, the MLE problem does not match the class of optimization problems for which the Levenberg-Marquardt algorithm can be used.


Homoscedastic Normal models
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Now, consider a homoscedastic model,

.. math::

	\begin{align}
	y_i \sim \text{Norm}(y(x_i;\theta), \sigma) \;\forall i.
	\end{align}

Now, the log likelihood function is

.. math::

	\begin{align}
	\ell(\theta, \sigma; \{x_i\}, \{y_i\}) = -\frac{n}{2}\,\ln 2\pi - n \ln \sigma - \frac{1}{2\sigma}\sum_i \left(y_i - y(x_i;\theta)\right)^2.
	\end{align}

Even though we do not know the value of :math:`\sigma`, we can still get a maximum likelihood estimate for :math:`\theta` without considering :math:`\sigma` because of the structure of the log likelihood. Regardless the value of :math:`\sigma`, the values of :math:`\theta` that minimize

.. math::

	\begin{align}
	\text{RSS} \equiv \sum_i \left(y_i - y(x_i;\theta)\right)^2,
	\end{align}

a quantity known as the **residual sum of squares**, give the maximum likelihood estimate. The MLE for :math:`\theta` can therefore be found using the Levenberg-Marquardt algorithm.

Assume for a moment we find the MLE :math:`\theta^*`, which gives us a residual sum of squares of :math:`\text{RSS}^*`. Then, the log-likelihood is

.. math::

	\begin{align}
	\ell(\theta^*, \sigma; \{x_i\}, \{y_i\}) = -\frac{n}{2}\,\ln 2\pi - n\ln\sigma - \frac{\text{RSS}^*}{2\sigma^2}.
	\end{align}

The value of :math:`\sigma` that maximizes the log-likelihood is then found by differentiating and setting to zero.

.. math::

	\begin{align}
	\frac{\partial \ell}{\partial \sigma} = -\frac{n}{\sigma} + \frac{\text{RSS}^*}{\sigma^3} = 0.
	\end{align}

This is solved to give :math:`\sigma^* = \sqrt{\text{RSS}^*/n}`.

So, we can split the optimization problem in two parts. First, find the values of parameters :math:`\theta` that minimize the residual sum of squares, typically using the Levenberg-Marquardt algorithm. Then, using these values, compute the MLE for :math:`\sigma` using the analytical result we have just derived.


A very important caveat
++++++++++++++++++++++++

Though commonly encountered, the special case we have just considered is not always the variate-covariate model we might choose. That means that the convenience of the Levenberg-Marquardt algorithm and the convenient result for getting a maximum likelihood estimate for the homoscedastic :math:`\sigma` will not in general work. They only work for this special case.


Theoretical functions that are linear in the parameters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Like other optimization algorithms we have seen so far, the Levenberg-Marquardt algorithm will only provide a *local* minimizer; it is not guaranteed to find a global minimum. That said, if the function :math:`y(x;\theta)` is linear in the parameters :math:`\theta`, it can be shown that, under certain restrictions, the minimizer :math:`\theta^*` of the RSS is unique, and therefore a local minimum is also a global minimum.

To show this, since :math:`y(x;\theta)` is linear in :math:`\theta`, we can write it in the form :math:`y(x;\theta) = \mathbf{z} \cdot \theta`. [#f2]_ For example, if :math:`y(x;\theta)` is a second order polynomial, :math:`y(x;\theta) = \theta_1 + \theta_2 x + \theta_3 x^2`, then :math:`\mathbf{z} = (1, x, x^2)` and :math:`\theta = (\theta_1, \theta_2, \theta_3)^\mathsf{T}`. Residual :math:`i` is :math:`y_i - \mathbf{z}_i \cdot \theta`, and the residual sum of squares is

.. math::

	\begin{align}
	\text{RSS} = \sum_i (y_i - \mathbf{z}_i \theta)^2,
	\end{align}

where :math:`\mathbf{z}_i = (1, x_i, x_i^2)` for the second order polynomial example, and we have omitted the :math:`\cdot` symbol, and will continue to do so for notational concision. We can write this in a more convenient form by defining :math:`\mathbf{y} = (y_1, y_2, \ldots y_n)^\mathsf{T}`. Further, for each data point :math:`i`, there is a :math:`z_i`. We define the matrix

.. math::

	\begin{align}
	\mathsf{Z} = \begin{pmatrix}
	\mathbf{z}_1\\
	\mathbf{z}_2\\
	\vdots\\
	\mathbf{z}_n
	\end{pmatrix}.
	\end{align}

The matrix :math:`\mathsf{Z}` is sometimes referred to as a **design matrix**. The residual sum of squares is then

.. math::

	\begin{align}
	\text{RSS} &= (\mathbf{y} - \mathsf{Z} \theta)^\mathsf{T}(\mathbf{y} - \mathsf{Z} \theta) \\[1em]
	&= \mathbf{y}^\mathsf{T} \mathbf{y} - \mathbf{y}^\mathsf{T} \mathsf{Z} \theta - (\mathsf{Z} \theta)^\mathsf{T}\mathbf{y} + (\mathsf{Z}\theta)^\mathsf{T} (\mathsf{Z}\theta) \\[1em]
	&= \mathbf{y}^\mathsf{T} \mathbf{y} - 2\theta^\mathsf{T}\mathsf{Z}^\mathsf{T} \mathbf{y} + \theta^\mathsf{T}\mathsf{Z}^\mathsf{T} \mathsf{Z}\theta.
	\end{align}

Differentiating the RSS with respect to :math:`\theta` gives

.. math::

	\begin{align}
	\nabla_\theta \text{RSS} = -2\mathsf{Z}^\mathsf{T}\mathbf{y} + 2\mathsf{Z}^\mathsf{T} \mathsf{Z}\theta.
	\end{align}

Setting this equal to zero and solving for :math:`\theta^*` (the value of :math:`\theta` where :math:`\nabla_\theta \text{RSS} = \mathbf{0}`) yields

.. math::

	\begin{align}
	\theta^* = (\mathsf{Z}^\mathsf{T} \mathsf{Z})^{-1}\mathsf{Z}^\mathsf{T}\mathbf{y}.
	\end{align}

Provided the matrix :math:`\mathsf{Z}^\mathsf{T}\mathsf{Z}` is nonsingular, this linear system has a unique solution.

This is referred to as **ordinary least squares**, or OLS, and is a subject in many introductory linear algebra courses. Though :math:`\theta^*` may be computed using matrix operations according to the above formula, in practice it is often simply calculated using the Levenberg-Marquardt algorithm, which is typically more numerically stable. Because of what we have just shown, whatever local minimum the algorithm finds is, in fact, a global minimum and therefore the unique MLE for :math:`\theta`.

Note that this is not generally true! There may be local minima of the RSS as a function of :math:`\theta` (and therefore local minima of the log likelihood) is :math:`y(x;\theta)` is nonlinear in the parameters :math:`\theta`.


.. [#f1] Note, though, in the implementation we will use in ``scipy.optimize.least_squares()``, :math:`z` must be entered as an array, even for a single parameter.

.. [#f2] In most books and other literature on the subject, the vector we have defined as :math:`\mathbf{z}` is denoted :math:`\mathbf{x}`. I find that variables named :math:`x` abound and the symbol is overloaded, which is why I defined it as :math:`\mathbf{z}` and the design matrix as :math:`\mathsf{Z}` instead of :math:`\mathsf{X}` (though the design matrix is often also denoted by :math:`\mathsf{A}`).