48 Principal component analysis

48.1 The problem of PCA

We often obtain high-dimensional data sets, in which each observation has many dimensions. In the parlance of practitioners of machine learning, each multidimensional observation is called a sample and each dimension is called a feature. We will use the observation/dimension and sample/feature nomenclature interchangeably. Considering the Palmer’s penguin data set, each observation (sample) is a penguin, and the dimensions (features) are bill depth, bill length, flipper length, and body mass.

The features, though many, may be related. We can imagine that there are unobservable variables, called latent variables at play the influence many features. For example, consider the four features of the penguin. They could all scale with the caloric intake of the penguins, in which case the latent variable of caloric intake could use used to predict all four of the features; it is responsible for most of the variation in what is observed.

Let us now formalize this. Let \(\mathbf{y}\in \mathbb{R}^D\) be a set of observable features. There are \(D\) of them. In the case of the penguins, \(D = 4\). Let \(\mathbf{z} \in \mathbb{R}^L\) be the set of latent variables that influence the observations. Typically, \(L < D\) and often \(L \ll D\), and, as such, if we can infer the latent variable \(\mathbf{z}\), we say we have achieved dimensionality reduction. Specifically, we define a model for \(\mathbf{y}\), \(\mathbf{y} = h(\mathbf{z};\theta)\), where \(\theta\) is some set of parameters. So, to achieve dimensionality reduction, we want to find a function \(h(\mathbf{z};\theta)\) that most faithfully reproduces the observed \(\mathbf{y}\).

48.2 Toward building a PCA model: Factor analysis

As we build our model of how the data \(\mathbf{y}\) are generated from the latent variables \(z\), we make a series of modeling decisions, many of which are intentionally chosen to be simplifying.

48.2.1 Linear \(h(\mathbf{z};\theta)\)

As is commonly done, we will restrict the functions \(h(\mathbf{z};\theta)\) we choose to be linear functions. Further, we stipulate that \(h(\mathbf{z};\theta)\) serves to orthogonally project \(\mathbf{z}\) onto \(\mathbf{y}\).

\[\begin{align} h(\mathbf{z};\mathsf{W}, \boldsymbol{\mu}) = \mathsf{W}\cdot \mathbf{z} + \boldsymbol{\mu}, \end{align} \]

where the parameters \(\theta\) are given by the \(D \times L\) matrix \(\mathsf{W}\), commonly referred to as a loading matrix, and \(\boldsymbol{\mu}\), which is an additive term that sets the value of \(\mathbf{y}\) when \(\mathbf{z} = 0\).

48.2.2 i.i.d. Normally distributed data points

In addition to choosing a linear function \(h(\mathbf{z};\mathsf{W}, \boldsymbol{\mu})\), we assume an i.i.d. Normal generative model,

\[\begin{align} \mathbf{y}_i \sim \text{Norm}(\mathsf{W}\cdot \mathbf{z}_i + \boldsymbol{\mu}, \mathsf{\Psi})\;\forall i, \end{align} \]

where \(\mathsf{\Psi}\) is a \(D\times D\) covariance matrix describing how the observed data vary from the linear model.

48.2.3 A diagonal covariance matrix

We typically take \(\mathsf{\Psi}\) to be diagonal; the residuals of the observations are modeled to be uncorrelated (though the observations are correlated through the latent variables \(\mathbf{z}\)). We therefore will define the vector \(\boldsymbol{\sigma}\) to be the diagonal elements of \(\mathsf{\Psi}\) such that \(\Psi = \text{diag}(\boldsymbol{\sigma}^2)\), where \(\boldsymbol{\sigma}^2\) denotes element-wise squaring of the elements of \(\boldsymbol{\sigma}\).

48.2.4 Modeling the latent variables

The latent variables, though unmeasured, are still random variables, as they vary meaningfully from experiment to experiment. As such, we need to specify a generative model for them. We take the latent variables to be \(L\)-variate Normally distributed. \[\begin{align} \mathbf{z}_i \sim \text{Norm}(\boldsymbol{\mu}_0, \mathsf{\Sigma}_0)\;\forall i. \end{align}\]

Note that since we are using a linear model and \(\mathbf{z}_i\) gets transformed according to \(\mathsf{W}\cdot \mathbf{z}_i + \boldsymbol{\mu}\), we can absorb \(\boldsymbol{\mu}_0\) and \(\mathsf{\Sigma}_0\) into \(\mathsf{W}\) and \(\boldsymbol{\mu}\). Absorbing these as described, we can write

\[\begin{align} \mathbf{z}_i \sim \text{Norm}(\mathbf{0}, \mathsf{I})\;\forall i, \end{align} \]

where \(\mathsf{I}\) is the \(L\times L\) identity matrix. Our model is then

\[\begin{align} &\mathsf{\Psi} = \text{diag}(\boldsymbol{\sigma}^2) \\[1em] &\mathbf{z}_i \sim \text{Norm}(\mathbf{0}, \mathsf{I})\;\forall i, \\[1em] &\mathbf{y}_i \sim \text{Norm}(\mathsf{W}\cdot \mathbf{z}_i + \boldsymbol{\mu}, \mathsf{\Psi})\; \forall i \end{align} \]

48.2.5 Restrictions on the loading matrix

We will not go into too much detail here, but in general the loading matrix is nonidentifiable. There are several approaches to deal with the nonidentifiability, with a popular one being to enforce that the loading matrix \(\mathsf{W}\) is semiorthogonal and unitary such that the columns are orthonormal vectors and \(\mathsf{W}^\mathsf{T}\cdot \mathsf{W} = \mathsf{I}\), the identity matrix. If \(\mathbf{w}_j\) is a column of \(\mathsf{W}\), then

\[\begin{align} \mathbf{w}_j^\mathsf{T}\cdot \mathbf{w}_k = \delta_{jk}, \end{align} \]

where \(\delta_{jk}\) is the Kronecker delta, which is equal to one when \(j = k\) and zero otherwise.

For clarity and for reference, the dimensionality of the variables are shown in the table below. The index \(i\) indexes data points, and there are \(N\) of them.

Variable	Description	Dimension
\(\boldsymbol{\mu}\)	location parameter of measurements	\(D\)-vector
\(\mathsf{W}\)	loading matrix	\(D \times L\) semiorthogonal unitary matrix
\(\boldsymbol{\sigma}\)	standard deviation for each measurement	positive \(D\)-vector
\(\mathsf{\Psi}\)	matrix representation of \(\boldsymbol{\sigma}^2\)	\(D \times D\) nonnegative diagonal matrix
\(\mathbf{z}_i\)	Latent variable for datum \(i\)	\(L\)-vector
\(\mathbf{y}_i\)	Datum \(i\)	\(D\)-vector

The model described above is often referred to as a factor analysis model. We have already made simplifications to a latent variable model, namely that dependence of the observations on the latent variables is linear with a semiorthogonal unitary loading matrix, that both the observations and the latent variables are Normally distributed, and that multivariate Normal distribution of the observations has a diagonal covariance matrix.

48.3 Probabilistic PCA

We consider first the special case where each entry in \(\boldsymbol{\sigma}\) is the same, say \(\sigma\). In that case, \(\mathsf{\Psi} = \sigma\mathsf{I}\). We further restrict \(\mathsf{W}\) to have orthonormal columns. The assumption here is that each data point is drawn from a Normal distribution with the same variance, but with mean \(\mathsf{W}\cdot \mathbf{z}_i + \boldsymbol{\mu}\). This special case is called probabilistic principle component analysis (PPCA).

The full model is

\[\begin{align} &\mathbf{z}_i \sim \text{Norm}(\mathbf{0}, \mathsf{I})\;\forall i, \\[1em] &\mathbf{y}_i \sim \text{Norm}(\mathsf{W}\cdot \mathbf{z}_i + \boldsymbol{\mu}, \sigma^2 \mathsf{I})\; \forall i \end{align} \]

Probabilistic PCA is a reasonable model; it assumes that each data point results from a linear combination of the latent variables and is uncorrelated with all other data points (except through the latent variables) with homoscedasticity.

48.3.1 Maximum likelihood for PPCA

PPCA has an added advantage that the maximum likelihood estimates for \(\boldsymbol{\mu}\), \(\mathsf{W}\), and \(\sigma\) may be computed analytically. The result is

\[\begin{align} &\boldsymbol{\mu}_\mathrm{MLE} = \bar{\mathbf{y}},\\[1em] &\mathsf{W}_\mathrm{MLE} = \mathsf{U}\cdot(\mathsf{\Lambda} - \sigma^2\mathsf{I})^{\frac{1}{2}} \\[1em] &\sigma_\mathrm{MLE}^2 = \frac{1}{D - L}\sum_{j=L+1}^D \lambda_j, \end{align} \tag{48.1}\]

where \(=(\lambda_1, \ldots, \lambda_D)\) are the eigenvalues of the empirical covariance matrix (the plug-in estimate for the covariance moatrix),

\[\begin{align} \hat{\mathsf{\Sigma}} = \frac{1}{N}\sum_{i=1}^N(\mathbf{y}_i - \bar{y})\cdot(\mathbf{y}_i - \bar{y})^\mathsf{T}, \end{align}\]

ordered from largest to smallest, \(\mathsf{\Lambda} = \text{diag}((\lambda_1, \ldots, \lambda_L))\), and \(\mathsf{U}\) is a matrix whose columns are the eigenvectors corresponding to eigenvalues \(\lambda_1, \ldots, \lambda_L\).

48.3.2 Recognition distribution for PPCA

Though we have MLEs for \(\boldsymbol{\mu}\), \(\mathsf{W}\), and \(\sigma\), we still do not know how these relate to the latent variables \(\mathbf{z}\). We seek a regonition distribution for \(\mathbf{z}_i\), \(f(\mathbf{z}_i \mid \mathbf{y}_i)\). This is the distribution of the latent parameters conditioned on the high-dimensional measurements \(\mathbf{y}_i\). To find it, we can use Bayes’s theorem.

\[\begin{align} f(\mathbf{z}_i \mid \mathbf{y}_i) = \frac{f(\mathbf{y}_i\mid \mathbf{z}_i)\,f(\mathbf{z}_i)}{f(\mathbf{y}_i)}. \end{align} \]

Because the denominator has no \(\mathbf{z}_i\)-dependence, it is a normalization constant. The recognition distribution is then a product of two Normal distributions, since

\[\begin{align} &\mathbf{y}_i \mid \mathbf{z}_i \sim \text{Norm}(\mathsf{W}\cdot \mathbf{z}_i + \boldsymbol{\mu}, \mathsf{\Psi}),\\[1em] &\mathbf{z}_i \sim \text{Norm}(\mathbf{0}, \mathsf{I}). \end{align} \]

After some algebraic grunge, we can work out that the recognition distribution parametrized by the MLE values of \(\boldsymbol{\mu}\), \(\mathsf{W}\), and \(\sigma\) is given by

\[\begin{align} \mathbf{z}_i\mid \mathbf{y}_i \sim \text{Norm}(\boldsymbol{\mu}_{\mathbf{z}\mid\mathbf{y}_i}, \mathsf{\Sigma}_{\mathbf{z}}), \end{align} \tag{48.2}\]

where

\[\begin{align} \mathsf{\Sigma}_{\mathbf{z}} = \sigma_\mathrm{MLE}^2\,(\sigma_\mathrm{MLE}^{2}\,\mathsf{I} + \mathsf{W}_\mathrm{MLE}^\mathsf{T}\cdot\mathsf{W}_\mathrm{MLE})^{-1} \end{align} \tag{48.3}\]

and

\[\begin{align} \boldsymbol{\mu}_{\mathbf{z}\mid\mathbf{y}_i} = (\sigma_\mathrm{MLE}^{2}\,\mathsf{I} + \mathsf{W}_\mathrm{MLE}^\mathsf{T}\cdot\mathsf{W}_\mathrm{MLE})^{-1}\cdot\mathsf{W}_\mathrm{MLE}^\mathsf{T}\cdot(\mathbf{y_i - \boldsymbol{\mu}_\mathrm{MLE}}). \end{align} \tag{48.4}\]

The “probabilistic” part of the name probabilistic PCA comes from the fact that we write a probability distribution for the latent parameters, which is given by equations 48.2, 48.3, and 48.4, above.

48.4 PCA as a limit of PPCA

Consider now the limit of PPCA where \(\sigma\) goes to zero. That is, we consider the limit where there is no noise in the measured data. The scale parameter of the recognition distribution vanishes and the location parameter becomes

\[\begin{align} \boldsymbol{\mu}_{\mathbf{z}\mid\mathbf{y}_i} = (\mathsf{W}_\mathrm{MLE}^\mathsf{T}\cdot\mathsf{W}_\mathrm{MLE})^{-1}\cdot\mathsf{W}_\mathrm{MLE}^\mathsf{T}\cdot(\mathbf{y_i - \bar{\mathbf{y}}}), \end{align} \tag{48.5}\]

with the MLE for the loading matrix given by

\[\begin{align} \mathsf{W}_\mathrm{MLE} = \mathsf{U}\cdot\mathsf{\Lambda}^{\frac{1}{2}}. \end{align} \]

Note that because the scale parameter of the recognition distribution is artificially set to zero in the PCA model, \(\boldsymbol{\mu}_{\mathbf{z}\mid\mathbf{y}_i}\) is our estimate for \(\mathbf{z}_i\). In this setting, we refer to the estimate of the latent variables \(\mathbf{z}_i\) as the principal components. We could get confidence regions for the principal components by bootstrapping, but that is seldom done. After all, we artifically set \(\sigma = 0\) to get the principal component. Instead of bootstrapping, we could just perform PPCA and get the distribution of the latent parameter values. Though it is important to note that these are themselves functions of the MLE of the parameters \(\boldsymbol{\mu}\), \(\mathsf{W}\), and \(\sigma\).

As a final note, looking at the form of our estimate of the latent variables given in Equation 48.5, we see that it is the solution to the least squares problem

\[\begin{align} \mathbf{z}_i = \text{arg min}_{\mathbf{z}_i} \left\lVert (\mathbf{y}_i - \bar{\mathbf{y}}_i) - \mathsf{W}_\mathrm{MLE}\cdot\mathbf{z}_i\right\rVert_2^2. \end{align} \]

The squared two-norm in the above equation is sometimes referred to as a loss function. The goal of PCA is to find \(\mathsf{W}_\mathrm{MLE}\) and \(\mathbf{z}_i\) that minimize it, as we have just shown. The idea is that the optimal projection of \(\mathbf{z}_i\) onto \(\mathbf{y}_i\) is the one that minimizes the difference (with the mean subtracted off).

48.4.1 Centering and scaling, the interpretation of \(\hat{\mathsf{\Sigma}}\), and maximal variance

Assume for a moment that the mean of \(\mathbf{y}\) is zero,

\[\begin{align} \bar{\mathbf{y}} = \frac{1}{N}\sum_{i=1}^N \mathbf{y}_i = \mathbf{0}, \end{align} \]

and the variance of each of the \(D\) entries in \(\mathbf{y}\) is one,

\[\begin{align} s_j = \frac{1}{N}\sum_{i=1}^N (y_{i,j} - \bar{y}_j)^2 = \frac{1}{N}\sum_{i=1}^N y_{i,j}^2 = 1\;\forall j\in[1, D]. \end{align} \]

We can always center and scale our data such that they have a mean of 0 and a variance of 1 by subtracting the plug-in estimate for the mean from every data point and dividing the result by the plug-in estimate for the standard deviation.

In that case, the entries of \(\hat{\mathsf{\Sigma}}\) are

\[\begin{align} \hat{\Sigma}_{jk} = \frac{1}{N}\sum_{i=1}^N y_{ij}\, y_{ik} = \frac{1}{N}\sum_{i=1}^N (y_{ij} - \bar{y_j})(y_{ik} - \bar{y_k}), \end{align} \tag{48.6}\]

which we recognize as the elements of the empirical covariance matrix (or, equivalently, since the diagonal entries are all one, an empirical correlation matrix). Thus, \(\hat{\mathsf{\Sigma}}\) is the plug-in estimate for the covariance matrix when using centered-and-scaled data.

Consider a principal component \(z_{ij}\). The plug-in estimate for the variance of this is

\[\begin{align} \widehat{\mathbb{V}(z_{ij})} = \frac{1}{N}\sum_{i=1}^N z_{ij}^2 - \left(\frac{1}{N}\sum_{i=1}^Nz_{ij}\right)^2. \end{align} \]

Let us compute the first moment.

\[\begin{align} \frac{1}{N}\sum_{i=1}^Nz_{ij} = \frac{1}{N}\sum_{i=1}^N \mathbf{w}_{j}^\mathsf{T} \cdot \mathbf{y}_i = \mathbf{w}_{j}^\mathsf{T}\cdot\left(\frac{1}{N}\sum_{i=1}^N \mathbf{y}_i\right) = 0, \end{align}\]

where we have recognized the average in parentheses to be zero since the data are centered. Therefore, the plug-in estimate for the variance is

\[\begin{align} \widehat{\mathbb{V}(z_{ij})} = \frac{1}{N}\sum_{i=1}^N z_{ij}^2 = \frac{1}{N}\sum_{i=1}^N \frac{1}{N}\sum_{i=1}^N(\mathbf{y}_i^\mathsf{T}\cdot\mathbf{w}_j)^2. \end{align}\]

This is a quadratic form involving the covariance matrix \(\hat{\mathsf{\Sigma}}\).

\[\begin{align} \widehat{\mathbb{V}(z_{ij})} = \mathbf{w}_j^\mathsf{T}\cdot\hat{\mathsf{\Sigma}}\cdot\mathbf{w}_j. \end{align}\]

Recall that the loss function is

\[\begin{align} \text{loss}(\mathsf{W}) = \frac{1}{N}\sum_{i=1}^N \mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - \sum_{j=1}^L \mathbf{w}_j^\mathsf{T}\cdot\hat{\mathsf{\Sigma}}\cdot\mathbf{w}_j = \text{constant} - \widehat{\mathbb{V}(z_{ij})}. \end{align}\]

Therefore, finding the transformation \(\mathbf{w}_j\) that minimizes the loss function is the same as finding the transformation that maximizes the variance of the projection. This is why the principal components are often referred to as the directions that have maximal variance.

As a final note, I mention that it is important to center and scale the data set before performing PCA because given directions may have high variance just because of the scaling of the measurement. This is generally not the variance we wish to interpret.

48.4.2 Prescription for PCA

Following the above analysis, we arrive at a prescription for PCA. After choosing the number of principal components \(L\), do the following.

Center and scale the data set by subtracting the mean and dividing by the standard deviation along each dimension.
Compute the empirical covariance matrix.
Compute the first \(L\) eigenvalues and eigenvectors.
Construct the loading matrix \(\mathsf{W}\) from the eigenvectors corresponding to the \(L\) largest eigenvalues.
The principal components are \(\mathbf{z}_i = \mathbf{W}^\mathsf{T} \cdot \mathbf{y}_i \;\forall i\).

48.4.3 Nonidentifiability of PCA

The loading matrix \(\mathsf{W}\) is comprised of the eigenvectors of the empirical covariance matrix and therefore the columns are orthogonal, since the covariance matrix is symmetric. However, any of these eigenvectors can be multiplied by a scalar (which is true in general for an eigenvector; if \(\mathbf{v}\) is an eigenvector of \(\mathsf{A}\), then so is \(\alpha \mathbf{v}\) for real \(\alpha \ne 0\).) We can insist that the columns of \(\mathsf{W}\) for an orthonormal basis, such that the magnitude of each eigenvector is 1, which is what is commonly done. However, any given column of \(\mathsf{W}\) may be multiplied by \(-1\), still resulting in a nonidentifiability. We therefore cannot uniquely find a loading matrix \(\mathsf{W}\) and therefore a unique set of principal components.