In the last few posts^{[1][2][3]} we have explored the subject of eigensystems, the set of vectors \(\mathbf{v}_i\) and associated numbers \(\lambda_i\) that for a given matrix \(\mathbf{M}\) satisfy
A common issue in the statistical analysis of data is that of redundancy, when some measurements are more or less entirely dependent upon others. For example, if we took a survey of the age, height and shoe size of a population, we should expect that the age of an individual would largely determine their height and shoe size since children tend to be shorter and have smaller feet than adults.
If we can identify such redundancies we can be sure to concentrate our analysis upon the truly important features of the data. Perhaps surprisingly, the rather abstract concept of eigensystems enables us to do exactly that!
A slightly less trivial, but enormously more useful, property of the covariance matrix is the fact that its eigenvectors define a new coordinate system in which the covariance matrix is simply the diagonal matrix of the eigenvalues, as proven in derivation 2. This means that the eigenvector with the largest eigenvalue is the direction that accounts for the largest part of the variation of the sample, the direction in which the data are most spread out.
Now, since \(\mathbf{\Sigma}\) is real and symmetric, the eigenvectors are orthogonal and so the one with the second largest eigenvalue is the direction in which the data are most spread out after we remove any variation in the first direction. If we keep on in the same fashion, we find that the eigensystem yields a set of axes ranked from the greatest to the least variation in the sample data.
We can consequently deal with redundancies in our data by simply ignoring directions in which there is relatively little variation, effectively compressing it onto the most descriptive subset of the axes.
This is known as principal component analysis, or PCA, on account of the ranked axes being known as the principal components of the data.
To calculate the coordinates of a vector \(\mathbf{w}_k\) along the principal components we simply subtract the mean \(\mathbf{\bar{w}}\) and calculate the vector product of the result with their eigenvectors \(\mathbf{v}_i\)
To transform those coordinates back to the original ones, we multiply them by the relevant eigenvectors and add the results to the mean
Note that since the data are centred at zero in both dimensions we do not need to subtract the mean from them when calculating the covariance matrix
Program 2 illustrates how we can use PCA to eliminate redundancy in this way by projecting a sample of data, marked with black crosses, onto the principal component that accounts for most of its variation, marked with red crosses.
Now I'm reasonably confident that this shows that PCA can be used to do exactly what I've claimed, but for a truly compelling example of its utility we need to apply it to some real data. Before we do so, however, there are a couple of other subjects that I'd like to dip into.
[2] Ascending The Eigen, www.thusspakeak.com, 2014.
[3] Conquering The Eigen, www.thusspakeak.com, 2014.
[4] Define Normal, www.thusspakeak.com, 2014.
\[
\mathbf{M} \times \mathbf{v}_i = \lambda_i \times \mathbf{v}_i
\]
and in this post we shall take a look at one way that we might actually use them.A common issue in the statistical analysis of data is that of redundancy, when some measurements are more or less entirely dependent upon others. For example, if we took a survey of the age, height and shoe size of a population, we should expect that the age of an individual would largely determine their height and shoe size since children tend to be shorter and have smaller feet than adults.
If we can identify such redundancies we can be sure to concentrate our analysis upon the truly important features of the data. Perhaps surprisingly, the rather abstract concept of eigensystems enables us to do exactly that!
Principal Component Analysis
We can represent our data as vectors, \(\mathbf{w}_k\), where each element corresponds to a particular measurement. In our earlier example we'd have one for each individual in which the first element equals the age, the second the height and the third the shoe size. We can then calculate the mean of these vectors in the same way that we would for numbers to get a vector representing the theoretically average member of the population of \(n\) individuals
\[
\mathbf{\bar{w}} = \frac{1}{n}\sum_{k=1}^n \mathbf{w}_k
\]
where \(\sum\) is the summation sign. We can measure how spread out the data are in the \(i^{th}\) element with the variance
\[
\frac{1}{n} \sum_{k=1}^n \left(\mathbf{w}_k  \mathbf{\bar{w}}\right)_i^2
\]
which is simply its mean squared difference from its average value. Similarly, we can calculate the interdependence of the \(i^{th}\) and \(j^{th}\) elements with the covariance
\[
\frac{1}{n} \sum_{k=1}^n \left(\mathbf{w}_k  \mathbf{\bar{w}}\right)_i \times \left(\mathbf{w}_k  \mathbf{\bar{w}}\right)_j
\]
Noting that the covariance is equal to the variance when \(i\) and \(j\) are equal, we can represent them both with a single matrix \(\mathbf{\Sigma}\) with elements
Derivation 1: Calculating The Covariance Matrix With W
Noting that
\[
\mathbf{W}^\mathrm{T}_{ik} = \mathbf{W}_{ki}
\]
by the rules of matrix multiplication, we have
\[
\begin{align*}
\left(\mathbf{W}^\mathrm{T}\times\mathbf{W}\right)_{ij} &= \sum_{k=1}^n \mathbf{W}^\mathrm{T}_{ik} \times \mathbf{W}_{kj}\\
&= \sum_{k=1}^n \mathbf{W}_{ki} \times \mathbf{W}_{kj}\\
&= \sum_{k=1}^n \left(\mathbf{w}_k  \mathbf{\bar{w}}\right)_i \times \left(\mathbf{w}_k  \mathbf{\bar{w}}\right)_j\\
&= n \times \mathbf{\Sigma}_{ij}
\end{align*}
\]
and hence
\[
\mathbf{\Sigma} = \frac{1}{n} \left(\mathbf{W}^\mathrm{T} \times \mathbf{W}\right)
\]
\[
\mathbf{\Sigma}_{ij} = \frac{1}{n} \sum_{k=1}^n \left(\mathbf{w}_k  \mathbf{\bar{w}}\right)_i \times \left(\mathbf{w}_k  \mathbf{\bar{w}}\right)_j
\]
known as the covariance matrix. An interesting point to note is that if we construct a matrix \(\mathbf{W}\) whose rows are the vectors \(\mathbf{w}_k  \mathbf{\bar{w}}\)
\[
\mathbf{W}_{ki} = \left(\mathbf{w}_k  \mathbf{\bar{w}}\right)_i
\]
then together with its transpose \(\mathbf{W}^\mathrm{T}\), which is the matrix that results from swapping the rows and colums of \(\mathbf{W}\), we have
\[
\mathbf{\Sigma} = \frac{1}{n} \left(\mathbf{W}^\mathrm{T} \times \mathbf{W}\right)
\]
as shown in derivation 1. A trivial consequence of this is that \(\mathbf{\Sigma}\) must be positive semidefinite, meaning that for every vector \(\mathbf{u}\)
\[
\mathbf{u} \times \mathbf{\Sigma} \times \mathbf{u} \geqslant 0
\]
since, on rearranging, we have
\[
\mathbf{u} \times \mathbf{\Sigma} \times \mathbf{u}
= \mathbf{u} \times \frac{1}{n} \left(\mathbf{W}^\mathrm{T} \times \mathbf{W}\right) \times \mathbf{u}
= \frac{1}{n} \times \left(\mathbf{u} \times \mathbf{W}^\mathrm{T}\right) \times \left(\mathbf{W} \times \mathbf{u}\right)
= \frac{1}{n} \times \mathbf{u}^\prime \times \mathbf{u}^\prime
\geqslant 0
\]
for some vector \(\mathbf{u}^\prime\). Note that this also implies that the eigenvalues of \(\mathbf{\Sigma}\) cannot be negative since
\[
0 \leqslant \mathbf{v}_i \times \mathbf{\Sigma} \times \mathbf{v}_i
= \mathbf{v}_i \times \left(\mathbf{\Sigma} \times \mathbf{v}_i\right)
= \mathbf{v}_i \times \left(\lambda_i \times \mathbf{v}_i\right)
= \lambda_i \times \mathbf{v}_i \times \mathbf{v}_i
= \lambda_i
\]
assuming that we have normalised the eigenvectors so that they are of unit length.
Derivation 2: Covariance On The Eigenvectors
To project the vector \(\left(\mathbf{w}_k  \mathbf{\bar{w}}\right)\) onto the axis defined by a unit eigenvector \(\mathbf{v}_i\) of \(\mathbf{\Sigma}\) we simply take their product
\[
\begin{align*}
\left(\mathbf{w}_k  \mathbf{\bar{w}}\right) \times \mathbf{v}_i &= \sum_l \left(\mathbf{w}_k  \mathbf{\bar{w}}\right)_l \times \left(\mathbf{v}_i\right)_l\\
&= \sum_l \mathbf{W}_{kl} \times \left(\mathbf{v}_i\right)_l
\end{align*}
\]
The covariance along \(\mathbf{v}_i\) and \(\mathbf{v}_j\) is therefore
\[
\begin{align*}
\mathbf{\Sigma}^\prime_{ij}
&= \frac{1}{n} \sum_{k=1}^n \left(\sum_l \mathbf{W}_{kl} \times \left(\mathbf{v}_i\right)_l\right) \times \left(\sum_m \mathbf{W}_{km} \times \left(\mathbf{v}_j\right)_m\right)\\
&= \frac{1}{n} \sum_{k=1}^n \left(\sum_l \left(\mathbf{v}_i\right)_l \times \mathbf{W}^\mathrm{T}_{lk}\right) \times \left(\sum_m \mathbf{W}_{km} \times \left(\mathbf{v}_j\right)_m\right)\\
&= \sum_l \sum_m \left(\mathbf{v}_i\right)_l \times \left(\frac{1}{n} \sum_{k=1}^n \mathbf{W}^\mathrm{T}_{lk} \times \mathbf{W}_{km}\right) \times \left(\mathbf{v}_j\right)_m\\
&= \sum_l \sum_m \left(\mathbf{v}_i\right)_l \times \mathbf{\Sigma}_{lm} \times \left(\mathbf{v}_j\right)_m\\
&= \mathbf{v}_i \times \mathbf{\Sigma} \times \mathbf{v}_j
= \mathbf{v}_i \times \lambda_j \times \mathbf{v}_j
= \lambda_j \times \mathbf{v}_i \times \mathbf{v}_j
\end{align*}
\]
Since \(\mathbf{\Sigma}\) is real and symmetric, the eigenvectors are orthogonal and, assuming that we have normalised them to unit length, we have
\[
\mathbf{\Sigma}^\prime_{ij} =
\begin{cases}
\lambda_j, &i = j\\
0, &i \neq j
\end{cases}
\]
Now, since \(\mathbf{\Sigma}\) is real and symmetric, the eigenvectors are orthogonal and so the one with the second largest eigenvalue is the direction in which the data are most spread out after we remove any variation in the first direction. If we keep on in the same fashion, we find that the eigensystem yields a set of axes ranked from the greatest to the least variation in the sample data.
We can consequently deal with redundancies in our data by simply ignoring directions in which there is relatively little variation, effectively compressing it onto the most descriptive subset of the axes.
This is known as principal component analysis, or PCA, on account of the ranked axes being known as the principal components of the data.
To calculate the coordinates of a vector \(\mathbf{w}_k\) along the principal components we simply subtract the mean \(\mathbf{\bar{w}}\) and calculate the vector product of the result with their eigenvectors \(\mathbf{v}_i\)
\[
\left(\mathbf{y}_k\right)_j = \left(\mathbf{w}_k  \mathbf{\bar{w}}\right) \times \mathbf{v}_j = \sum_i \left(\mathbf{w}_k  \mathbf{\bar{w}}\right)_i \times \left(\mathbf{v}_j\right)_i
\]
Of course if our goal is to eliminate redundancy we'll only be interested in the principal components with the largest eigenvalues and consequently \(\mathbf{y}_k\) will have fewer dimensions than \(\mathbf{w}_k\).To transform those coordinates back to the original ones, we multiply them by the relevant eigenvectors and add the results to the mean
\[
\mathbf{w}_k \approx \mathbf{\bar{w}} + \sum_j \left(\mathbf{y}_k\right)_j \times \mathbf{v}_j
\]
Naturally, if we have discarded the principal components with the least variation, this will only approximately recover the original vector; hence the wavy equals sign.
A Simple Demonstration Of PCA
The easiest way to demonstrate PCA is to apply it to a set of two dimensional data points that are spread out in directions that are not aligned with the \(x\) and \(y\) axes. We can create such a set of data with linear combinations of normally distributed random numbers
\[
\begin{align*}
z_0 &\sim N(0, 1)\\
z_1 &\sim N(0, 1)\\
\\
x &= a \times z_0 + b \times z_1\\
y &= c \times z_0 + d \times z_1
\end{align*}
\]
and as luck would have it we've already implemented a normal random number generator, ak.normalRnd
^{[4]}, as used in program 1.
Program 1: PCA In Action



Note that since the data are centred at zero in both dimensions we do not need to subtract the mean from them when calculating the covariance matrix
s
, from which we derive the principal components with ak.jacobiDecomposition
. We multiply them by the square root of their associated eigenvalues before plotting them over the scatter plot of the data to illustrate the standard deviation of the variation in their directions, which I think clearly demonstrates that the PCA has correctly identified the directions that best account for the variation in the data.Program 2 illustrates how we can use PCA to eliminate redundancy in this way by projecting a sample of data, marked with black crosses, onto the principal component that accounts for most of its variation, marked with red crosses.
Program 2: Eliminating Redundancy



Now I'm reasonably confident that this shows that PCA can be used to do exactly what I've claimed, but for a truly compelling example of its utility we need to apply it to some real data. Before we do so, however, there are a couple of other subjects that I'd like to dip into.
References
[1] At The Foot Of The Eigen, www.thusspakeak.com, 2014.[2] Ascending The Eigen, www.thusspakeak.com, 2014.
[3] Conquering The Eigen, www.thusspakeak.com, 2014.
[4] Define Normal, www.thusspakeak.com, 2014.
Leave a comment