Try Another K

| 0 Comments

In our exploration of the curious subject of cluster analysis[1], in which the goal is to classify a set of data into subsets of similar data without having a rigorous mathematical definition of what that actually means, we have covered the \(k\) means algorithm[2] that implicitly defines a clustering as minimising the sum of squared distances of the members of the clusters from their means and have proposed that we might compare clusterings by the amount of the variance in the data that they account for[3], given by
\[ \frac{1}{2n^2}\sum_{i=1}^k \sum_{j_1 \in C_i} \sum_{j_2 \notin C_i} \left|x_{j_1} - x_{j_2}\right|^2 \]
where \(\sum\) is the summation sign, \(C_i\) is the \(i^\mathrm{th}\) cluster and \(x_j\) is the \(j^\mathrm{th}\) data point, \(\in\) means within, \(\notin\) means not within and the vertical bars stand for the magnitude of the value they enclose, in this case equalling the distance between the data points in the subtraction.

Much as we might use principal component analysis[4] to choose a subset of the axes that best account for the variance in a set of data to compress it onto fewer dimensions without losing too much information, we tried to identify optimal clusterings by considering how much of the variance they accounted for.
The simplest scheme was to predefine some fraction of the variance to be accounted for and choosing the first value of \(k\) that met it. Unfortunately, it's rather difficult to decide what fraction of the variance is the right one, as demonstrated by program 1.

Program 1: The Right Fraction?

To see the problem, first try running the program with different values of r until the \(k\) means algorithm correctly identifies the five clusters. Then, by clicking on the canvas, add a new cluster that is somewhat closer to one of the originals than they are to each other, say the one in the top right hand corner. Finally, run the program a few more times.

Annoying, isn't it?

To avoid this problem we tried looking at how the rate at which the accounted for variance increased with \(k\), the idea being that if increasing the value of \(k\) resulted in a relatively small increase in that variance then we'd probably have already found a good clustering and should stop looking. Program 2 illustrates this idea by plotting how much more variance of program 1's data would be accounted for by increasing \(k\) by one as a proportion of the total variance of that data.

Program 2: The Rate Of Change Of Variance

Whilst this is undoubtedly more robust than guessing at the share of the variance that the correct clustering should account for, the choice of how small an increase is small enough is still a rather subjective one and so this scheme is not an ideal candidate for automation.

An Informed Decision

We've already noted that a \(k\) means clustering can be thought of as compressing a set of data into a set of cluster representatives, so it's rather tempting to consider an information theoretic measure of the error of that compression when choosing a clustering.

Information theory is, as its name suggests, concerned with the study of information; its definition and how to measure, transmit and store it. The key insight that kick-started the entire subject was that information is not the same thing as data[5], since data often contains redundancies that can be removed without affecting its information content, as dmnstrtd by th fct tht u cn undrstnd ths.
The information in a set of data is essentially what is left after it has been compressed as much as possible without loss of meaning and consequently compression is a fundamental concept in information theory. Now for data like a text message we should typically require it to be reproduced exactly when it is uncompressed, but this is not always the case. For example, data like a telephone signal can still be understood if the compression introduces a degree of noise, or distortion. We might naively measure this distortion with the expected mean squared difference between the original data and its reconstruction, but that fails to take into account that not all distortion is created equal. For telephone conversations, distortion at frequencies beyond the range of human hearing is irrelevant and so should be ignored by any reasonable measure of it.
Given that we can't recreate the clustered data from the set of cluster representatives and the number of the data that belong to each of their clusters, this being the obvious way to compress a set of data with a \(k\) means clustering, it clearly has more in common with a distorted telephone signal than it has with a perfectly reconstructed text message. We might therefore try to find a measure of cluster distortion that reflects the implicit definition of what constitutes a cluster according to the \(k\) means algorithm.
Since the \(k\) means algorithm seeks to minimise the sum of the squared distances between the data and the representatives of the clusters that they are members of, it naturally favours globular, or in other words approximately spherical, clusters. If we also assume that the clusters are approximately the same size and shape, not unreasonably if they are the result of some consistent source of variability around truly representative data points, then there's a relatively simple measure of distortion that accentuates them based on, would you believe it, principal component analysis.

Principal Components Again

You will no doubt recall that the principal components of a set of data are the axes, centred on its mean, that most effectively account for its variance, as illustrated by program 3 in which the red lines show the directions of the principal components and their lengths show the standard deviations of the data in their directions.

Program 3: The Principal Components

If we stretch the data out along each principal component until its variances in their directions are all equal to one then the squares of their distances to the mean represent normalised errors in which each source of variance is given equal weight.

Derivation 1: The Mean Of The Vector Products
The mean of the products of the vectors \(\mathbf{x}_i\) with the vector \(\mathbf{y}\) is simply
\[ \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i \times \mathbf{y} \]
Denoting the \(j^\mathrm{th}\) element of a vector \(\mathbf{v}\) with \(v_j\) and that of a vector \(\mathbf{v}_i\) with \(v_{ij}\), then by the definition of the vector product we have
\[ \begin{align*} \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i \times \mathbf{y} &= \frac{1}{n} \sum_{i=1}^n \sum_j x_{ij} \times y_j\\ &= \sum_j \left(\frac{1}{n}\sum_{i=1}^n x_{ij}\right) \times y_j\\ &= \sum_j \bar{x}_j \times y_j\\ &= \mathbf{\bar{x}} \times \mathbf{y} \end{align*} \]

Derivation 2: The Transformed Variance
If a set of numbers \(x_i\) has a mean of zero then so does the set of their products with a finite constant \(c\), since
\[ \frac{1}{n} \sum_{i=1}^n c \times x_i = c \times \frac{1}{n} \sum_{i=1}^n x_i = c \times 0 = 0 \]
The variance of those products is therefore given by
\[ \begin{align*} \frac{1}{n} \sum_{i=1}^n \left(c \times x_i - 0\right)^2 &= \frac{1}{n} \sum_{i=1}^n c^2 \times x_i^2\\ &= c^2 \times \frac{1}{n} \sum_{i=1}^n x_i^2\\ &= c^2 \times \frac{1}{n} \sum_{i=1}^n \left(x_i-0\right)^2\\ &= c^2 \lambda \end{align*} \]
and so setting \(c\) to \(\dfrac{1}{\sqrt{\lambda}}\) yields a variance of one.
We have already seen that we can find how far a vector \(\mathbf{x}\) extends in the direction of a vector \(\mathbf{y}\) with their vector product[6], defined by
\[ \mathbf{x} \times \mathbf{y} = \sum_j x_j \times y_j \]
Note that if a set of vectors \(\mathbf{x}_i\) has a mean of \(\mathbf{\bar{x}}\) then the set of their vector products with \(\mathbf{y}\) has a mean of \(\mathbf{\bar{x}} \times \mathbf{y}\), as proven by derivation 1.
Now, if a set of numbers has a mean of zero and a variance of \(\lambda\) then we can transform it into a set of numbers with a mean of zero and a variance of one by dividing each of them by their standard deviation \(\sqrt{\lambda}\), as proven by derivation 2.
Finally, it is trivially the case that the vectors \(\left(\mathbf{x}_i-\mathbf{\bar{x}}\right)\) have a mean of zero.

Putting these observations together, and assuming that the data are vectors, if the \(i^\mathrm{th}\) principal component is \(\mathbf{v}_i\) along which the data has variance \(\lambda_i\) then for a given datum \(\mathbf{x}\) the aforementioned normalised error is given by
\[ \sum_i \left(\frac{\left(\mathbf{x}-\mathbf{\bar{x}}\right) \times \mathbf{v}_i}{\sqrt{\lambda_i}}\right)^2 \]
The principal components are simply the eigenvectors of the covariance matrix \(\mathbf{\Sigma}\) whose elements are the covariances of the relevant elements of the data
\[ \Sigma_{ij} = \frac{1}{n} \sum_{k=1}^n \left(x_{ik} - \bar{x}_k\right) \times \left(x_{jk} - \bar{x}_k\right) \]
where \(x_{ik}\) is the \(k^\mathrm{th}\) element of the \(i^\mathrm{th}\) datum and \(\bar{x}_k\) is the \(k^\mathrm{th}\) element of their mean. Furthermore, the variances of the data along each principal component are their eigenvalues.

Eigensystems Again

Derivation 3: Two Properties Of Eigensystems
We have already proven that
\[ \mathbf{V}^\mathrm{T} \times \mathbf{V} = \mathbf{V} \times \mathbf{V}^\mathrm{T} = \mathbf{I} \]
where \(\mathbf{I}\) is the identity matrix, and we shall take it as given here. Since the columns of \(\mathbf{V}\) are the eigenvectors of \(\mathbf{M}\) and the diagonal elements of \(\mathbf{\Lambda}\) are their eigenvalues we also have
\[ \mathbf{M} \times \mathbf{V} = \mathbf{V} \times \mathbf{\Lambda} \]
and consequently
\[ \mathbf{M} = \mathbf{M} \times \mathbf{V} \times \mathbf{V}^\mathrm{T} = \mathbf{V} \times \mathbf{\Lambda} \times \mathbf{V}^\mathrm{T} \]
Next, we have
\[ \begin{align*} &\left(\mathbf{V} \times \mathbf{\Lambda} \times \mathbf{V}^\mathrm{T}\right) \times \left(\mathbf{V} \times \mathbf{\Lambda}^{-1} \times \mathbf{V}^\mathrm{T}\right)\\ &\quad= \mathbf{V} \times \mathbf{\Lambda} \times \left(\mathbf{V}^\mathrm{T} \times \mathbf{V}\right) \times \mathbf{\Lambda}^{-1} \times \mathbf{V}^\mathrm{T}\\ &\quad= \mathbf{V} \times \mathbf{\Lambda} \times \mathbf{I} \times \mathbf{\Lambda}^{-1} \times \mathbf{V}^\mathrm{T}\\ &\quad= \mathbf{V} \times \mathbf{\Lambda} \times \mathbf{\Lambda}^{-1} \times \mathbf{V}^\mathrm{T}\\ &\quad= \mathbf{V} \times \left(\mathbf{\Lambda} \times \mathbf{\Lambda}^{-1}\right) \times \mathbf{V}^\mathrm{T}\\ &\quad= \mathbf{V} \times \mathbf{I} \times \mathbf{V}^\mathrm{T} = \mathbf{V} \times \mathbf{V}^\mathrm{T} = \mathbf{I} \end{align*} \]

Derivation 4: The Normalised Error
Since the columns of \(\mathbf{V}\) are the eigenvectors of \(\mathbf{\Sigma}\) we have
\[ \begin{align*} \mathbf{y}_1 &= \left(\mathbf{x}-\mathbf{\bar{x}}\right) \times \mathbf{V}\\ y_{1i} &= \left(\mathbf{x}-\mathbf{\bar{x}}\right) \times \mathbf{v}_i\\ \\ \mathbf{y}_2 &= \mathbf{V}^\mathrm{T} \times \left(\mathbf{x}-\mathbf{\bar{x}}\right)\\ y_{2i} &= \mathbf{v}_i \times \left(\mathbf{x}-\mathbf{\bar{x}}\right) = \left(\mathbf{x}-\mathbf{\bar{x}}\right) \times \mathbf{v}_i \end{align*} \]
Since \(\mathbf{\Lambda}\) is diagonal, then so is \(\mathbf{\Lambda}^{-1}\) and its elements are simply the reciprocals of the eigenvalues of \(\mathbf{\Sigma}\) and so
\[ \begin{align*} \mathbf{y}_3 &= \mathbf{\Lambda}^{-1} \times \mathbf{y}_2\\ y_{3i} &= \frac{\left(\mathbf{x}-\mathbf{\bar{x}}\right) \times \mathbf{v}_i}{\lambda_i} \end{align*} \]
Finally, we have
\[ \begin{align*} \mathbf{y}_1 \times \mathbf{y}_3 &= \left(\mathbf{x}-\mathbf{\bar{x}}\right) \times \mathbf{\Sigma}^{-1} \times \left(\mathbf{x}-\mathbf{\bar{x}}\right)\\ &= \sum_i y_{1i} \times y_{3i}\\ &= \sum_i \left(\mathbf{x}-\mathbf{\bar{x}}\right) \times \mathbf{v}_i \times \frac{\left(\mathbf{x}-\mathbf{\bar{x}}\right) \times \mathbf{v}_i}{\lambda_i}\\ &= \sum_i \left(\frac{\left(\mathbf{x}-\mathbf{\bar{x}}\right) \times \mathbf{v}_i}{\sqrt{\lambda_i}}\right)^2 \end{align*} \]
As previously described[7], the eigenvectors \(\mathbf{v}_i\) and eigenvalues \(\lambda_i\) of a matrix \(\mathbf{M}\) are the non-zero vectors and possibly zero numbers satisfying the equation
\[ \mathbf{M} \times \mathbf{v}_i = \lambda_i \times \mathbf{v}_i \]
We have proven that if the matrix is both real, as opposed to complex, and symmetric, so that its elements satisfy
\[ M_{ij}=M_{ji} \]
then its eigenvectors and eigenvalues are also real[8]. Note that we typically require that the eigenvectors have a magnitude of one since multiplying them by a constant has no effect on their satisfying the equation.

Now if we construct a matrix \(\mathbf{V}\) whose columns are those unit eigenvectors
\[ V_{ij}=v_{ij} \]
where \(v_{ij}\) is the \(j^\mathrm{th}\) element of the \(i^\mathrm{th}\) eigenvector, and another \(\mathbf{\Lambda}\) that is zero everywhere except on the diagonal where it equals the eigenvalues
\[ \Lambda_{ii} = \lambda_i \]
then we can recover the original matrix with
\[ \mathbf{M} = \mathbf{V} \times \mathbf{\Lambda} \times \mathbf{V}^\mathrm{T} \]
where \(\mathbf{M}^\mathrm{T}\) is the transpose of \(\mathbf{M}\) in which the rows and columns are swapped
\[ M^\mathrm{T}_{ij} = M_{ji} \]
We can also find its inverse with
\[ \mathbf{M}^{-1} = \mathbf{V} \times \mathbf{\Lambda}^{-1} \times \mathbf{V}^\mathrm{T} \]
both of which are proven in derivation 3.

The upshot of this, admittedly somewhat tedious, refresher in linear algebra is that the principal component analysis inspired normalised error is simply
\[ \left(\mathbf{x}-\mathbf{\bar{x}}\right) \times \mathbf{\Sigma}^{-1} \times \left(\mathbf{x}-\mathbf{\bar{x}}\right) \]
as proven by derivation 4.

Distortion At Last

Under the aforementioned assumption that the true clusters in our data are roughly spherical and are also approximately the same size and shape then it's not unreasonable to suppose that there's a single set of principal components that lie in nearly the same directions as those that we might separately find for each cluster.
If that single set of principal components has an implied covariance matrix of \(\mathbf{\Sigma}\) then we can define the cluster distortion of a set of \(n\) clusters \(C_i\) with means \(\mathbf{\bar{x}}_i\) of a data set \(\mathbf{x}_j\) in terms of the normalised error with
\[ \sum_{i=1}^n \sum_{j \in C_i} \left(\mathbf{x}_j-\mathbf{\bar{x}}_i\right) \times \mathbf{\Sigma}^{-1} \times \left(\mathbf{x}_j-\mathbf{\bar{x}}_i\right) \]
The trick, of course, is to find a way to calculate \(\mathbf{\Sigma}\) for any given clustering that yields a measure of cluster distortion that we can use to effectively choose between them.

One way to do it is to calculate the covariance matrices for each cluster, lets call them \(\mathbf{\Sigma}_i\), and take their average
\[ \mathbf{\bar{\Sigma}} = \frac{1}{n} \sum_{i=1}^n \mathbf{\Sigma}_i \]
Now if you've mentally raised an eyebrow and thought to yourself, hang on a minute that's not how you take averages across multiple sets of data, you need to weight them by their sizes \(n_i\)
\[ \mathbf{\bar{\Sigma}} = \frac{1}{\sum_{i=1}^n n_i} \sum_{i=1}^n n_i \mathbf{\Sigma}_i \]
then well done, you are quite correct!

But...

The point here is to create a measure of distortion that captures our presupposition that the clusters should be roughly the same size and the undue weight given to smaller clusters by the wrong way of calculating the average covariance does exactly that.
To see why, imagine that we have found the correct clustering with \(k\) clusters using the \(k\) means algorithm. If we were to reduce the number of clusters by one by joining two of them together then that joint cluster's normalised error would be amplified by the fact that its covariance would be larger than average. Furthermore, if we were to increase the number of clusters by one by splitting one of them in two then they would cause a reduction in our average covariance, increasing the normalised error of the rest of the clusters.
Derivation 5: The Covariance As A Matrix Product
If we define a matrix \(\mathbf{M}\) with
\[ M_{ij} = x_{ij} - \bar{x}_j \]
where \(x_{ij}\) is the \(j^\mathrm{th}\) element of the \(i^\mathrm{th}\) datum and \(\bar{x}_j\) is the \(j^\mathrm{th}\) element of their mean then, by the rules of matrix multiplication, we have
\[ \begin{align*} \left(\mathbf{M}^\mathrm{T} \times \mathbf{M}\right)_{ij} &= \sum_{k=1}^n M^\mathrm{T}_{ik} \times M_{kj}\\ &= \sum_{k=1}^n M_{ki} \times M_{kj}\\ &= \sum_{k=1}^n \left(x_{ki} - \bar{x}_i\right) \times \left(x_{kj} - \bar{x}_j\right)\\ &= n \times \Sigma_{ij} \end{align*} \]
If then, as we increase \(k\), we see a minimum in this normalised error then it would be fair to assume that it indicates that we've found the correct clustering, or at the very least a pretty good one. The great advantage over our previous schemes for choosing clusterings is that it's completely objective; there's no need to choose some arbitrary level of the normalised error in order to spot a local minimum in it!

ak.clusterDistortion

In implementing our measure of cluster distortion we shall exploit a trick of linear algebra by calculating the covariance matrix using the product of the transpose of a matrix whose rows are the differences between the data and their mean with itself, as illustrated by derivation 5.
Those differences are calculated by the offsets function given in listing 1.

Listing 1: The offsets Function
function offsets(c, d) {
 var n = c.size();
 var offs = new Array(n);
 var m, i;

 m = d.at(c.at(0));
 if(ak.type(m)!==ak.VECTOR_T) throw new Error('invalid argument in ak.clusterDistortion');
 for(i=1;i<n;++i) m = ak.add(m, d.at(c.at(i)));
 m = ak.div(m, n);

 for(i=0;i<n;++i) offs[i] = ak.sub(d.at(c.at(i)), m);
 return offs;
}

Here, c is expected to be an ak.cluster object and d an ak.clusterData object containing ak.vector objects and the function simply calculates the mean of those data that are members of the cluster and then their differences from it.
We can convert the array of ak.vector objects into an ak.matrix object[9] whose rows are those vectors by first converting them into arrays with their toArray methods and using the resulting array of arrays to construct it, as done in listing 2.

Listing 2: ak.clusterDistortion
ak.clusterDistortion = function(c) {
 var dist = 0;
 var clusters, data, nc, nd, i, j, d, offs, cov, offi, inv;

 if(ak.type(c)!==ak.CLUSTERING_T || ak.type(c.data)===ak.UNDEFINED_T) {
  throw new Error('invalid argument in ak.clusterDistortion');
 }

 clusters = c.clusters;
 data = c.data;
 nc = clusters.size();
 nd = data.size();

 if(nd>1) {
  d = data.at(0).dims();
  offs = [];
  cov = ak.matrix(d, d, 0);

  for(i=0;i<nc;++i) {
   offi = offsets(clusters.at(i), data);
   offs.push.apply(offs, offi);
   for(j=0;j<offi.length;++j) offi[j] = offi[j].toArray();
   offi = ak.matrix(offi);
   cov = ak.add(cov, ak.div(ak.mul(ak.transpose(offi), offi), offi.rows()));
  }
  cov = ak.div(cov, nc);
  inv = ak.inv(ak.jacobiDecomposition(cov));

  for(i=0;i<nd;++i) dist += ak.mul(offs[i], ak.mul(inv, offs[i]));
  dist /= nd*d;
 }
 return dist;
};

Given that our average covariance matrix cov is both real and symmetric it makes sense to invert it using its eigensystem which we find with the Jacobi decomposition[10], as we did in program 3. Finally we add up the normalised errors of the vector offsets from their cluster means stored in offs using our average covariance and return the sum divided by both the number of data and their dimension so as to yield the average distortion per dimension.

Program 4 plots the distortion against \(k\) for clusterings of the data from program 1 and if you run it several time you will see that the first local minimum frequently, but not always, occurs for \(k\) equal to five, the correct number of clusters.

Program 4: The Cluster Distortion

That the first local minimum doesn't always occur at the correct number of clusters is rather frustrating, but at least the choice is unambiguous, which is frankly a damn sight better than we've managed so far.
Better yet would be to find a clustering without specifying in advance how many clusters there should be, circumventing the problem entirely, and we shall attempt to do so in a future post.

References

[1] Some Of These Things Are Not Like The Others, www.thusspakeak.com, 2015.

[2] K Means Business, www.thusspakeak.com, 2015.

[3] A Good K To Try, www.thusspakeak.com, 2015.

[4] The Principals Of The Things, www.thusspakeak.com, 2014.

[5] Shannon, C., A Mathematical Theory of Communication, The Bell System Technical Journal, Vol. 27, pp. 379-423, 623-656, 1948

[6] What's Our Vector, Victor?, www.thusspakeak.com, 2014.

[7] At The Foot Of The Eigen, www.thusspakeak.com, 2014.

[8] Ascending The Eigen, www.thusspakeak.com, 2014.

[9] The Matrix Recoded, www.thusspakeak.com, 2014.

[10] Conquering The Eigen, www.thusspakeak.com, 2014.

Leave a comment