A Good K To Try

| 0 Comments

We have seen how the \(k\) means algorithm[1] can classify a set of data into \(k\) subsets of mutually similar data with the simple iterative scheme of placing each datum into the cluster whose representative it is closest to and then replacing those representatives with the means of the data in each cluster. Whilst this has a reasonably intuitive implicit definition of similarity it also has the unfortunate problem that we need to know how many clusters there are in the data if we are to have any hope of correctly identifying them, as evidenced by the rather unintuitive results of program 1.

Program 1: The Wrong k

Note that clicking on the chart area will add data to the points variable and the Clear button next to the Reset button will empty it so you can experiment with your own inputs to ak.kMeansClustering.

Comparing Clusterings

Last time we proved that each step of the \(k\) means algorithm must result in the sum of squared distances between each cluster representative and the members of its cluster growing no larger; in this sense it is a kind of local minimisation algorithm for that sum, akin to the polytope algorithm that we covered in some previous posts[2][3].
It's rather tempting, therefore, to see if we can exploit the relationship between different values of \(k\) and the value of the sum upon which the algorithm converges for each of them to decide what an appropriate value of the former might be.

Now, we'll need to be able to make a like for like comparison for different values of \(k\), which we can do by considering the total sum of squared differences between every member of a set
\[ \sum_{i=1}^n \sum_{j=1}^n \left|x_i - x_j\right|^2 \]
where \(\sum\) is the summation sign and the vertical bars represent the magnitude of the value between them, in this case the distance from \(x_i\) to \(x_j\). The advantage of this over the original sum is that, for \(k\) clusters \(C_i\) of \(n\) data \(x_j\), we trivially have
\[ \sum_{i=1}^k \sum_{j_1 \in C_i} \sum_{j_2 \in C_i} \left|x_{j_1} - x_{j_2}\right|^2 + \sum_{i=1}^k \sum_{j_1 \in C_i} \sum_{j_2 \notin C_i} \left|x_{j_1} - x_{j_2}\right|^2 = \sum_{j_1=1}^n \sum_{j_2=1}^n \left|x_{j_1} - x_{j_2}\right|^2 \]
where \(\in\) means within and \(\notin\) means not within, since each point is either in a given cluster or it's in another one.
The sum of squared distances between every point and the sum of squared differences from their mean are related by
\[ \sum_{i=1}^n \sum_{j=1}^n \left|x_i - x_j\right|^2 = 2n \times \sum_{i=1}^n \left|x_i - \bar{x}\right|^2 \]
where \(\bar{x}\) is the mean, as proven in derivation 1.

Derivation 1: Sums Of Squared Distances
For all arithmetic types the notion of straight line distance is equivalent to that of vectors so we need only consider them, for which
\[ |\mathbf{x}-\mathbf{y}|^2 = (\mathbf{x}-\mathbf{y}) \times (\mathbf{x}-\mathbf{y}) = \mathbf{x} \times \mathbf{x} - 2 \times \mathbf{x} \times \mathbf{y} + \mathbf{y} \times \mathbf{y} \]
Applying this equivalence to the sums of squared distances we have
\[ \begin{align*} \sum_{i=1}^n \left|x_i - \bar{x}\right|^2 & = \sum_{i=1}^n \left(x_i^2 - 2 x_i \bar{x} + \bar{x}^2\right) = \sum_{i=1}^n x_i^2 - 2\bar{x} \sum_{i=1}^n x_i + \sum_{i=1}^n \bar{x}^2\\ & = \sum_{i=1}^n x_i^2 - 2\bar{x} \times n\bar{x} + n \bar{x}^2 = \sum_{i=1}^n x_i^2 - n\bar{x}^2 \end{align*} \]
and
\[ \begin{align*} \sum_{i=1}^n \sum_{j=1}^n \left|x_i - x_j\right|^2 & = \sum_{i=1}^n \sum_{j=1}^n \left(x_i^2 - 2 x_i x_j + x_j^2\right) = \sum_{i=1}^n \sum_{j=1}^n x_i^2 - \sum_{i=1}^n \sum_{j=1}^n 2 x_i x_j + \sum_{i=1}^n \sum_{j=1}^n x_j^2\\ & = n\sum_{i=1}^n x_i^2 - 2 \sum_{i=1}^n x_i \sum_{j=1}^n x_j + n \sum_{j=1}^n x_j^2 = 2n\sum_{i=1}^n x_i^2 - 2 \left(\sum_{i=1}^n x_i\right)^2\\ & = 2n\sum_{i=1}^n x_i^2 - 2n^2 \bar{x}^2 = 2n\left(\sum_{i=1}^n x_i^2 - n \bar{x}^2\right) = 2n \sum_{i=1}^n \left|x_i - \bar{x}\right|^2 \end{align*} \]

This implies that
\[ \frac{1}{2n^2}\sum_{i=1}^n \sum_{j=1}^n \left|x_i - x_j\right|^2 \]
is a measure of the variance of the data and that
\[ \frac{1}{2n^2}\sum_{i=1}^k \sum_{j_1 \in C_i} \sum_{j_2 \notin C_i} \left|x_{j_1} - x_{j_2}\right|^2 \]
is the amount of it that is accounted for by the clustering.

Now this is rather evocative of how we used principal component analysis[4] to find axes that accounted for the most to the least variance in a set of data and that similarity suggests a way that we might settle upon a suitable value for \(k\); just as we could project the data onto a subset of the principal components to represent it with fewer dimensions whilst preserving a sufficient fraction of its variance to still be useful, so we can decide upon how much of the variance we want the clustering to account for and search for a \(k\) that meets that requirement. That such a \(k\) exists follows from the fact that a clustering with just one cluster accounts for none of the variance and one in which every datum is in its own cluster accounts for all of it.

Of course, before we can use the cluster variance to judge the effectiveness of a clustering we'll need to implement a function to measure it. So let's get on with it!

Calculating The Cluster Variance

On the face of it calculating the variance that a clustering accounts for would appear to be an order \(n^2\) operation. With a little thought, however, we can bring it down to a much more appealing order \(n\).
To do this we first express it as the difference between the total variance and the variance that the clustering doesn't account for
\[ \frac{1}{2n^2} \sum_{i=1}^k \sum_{j_1 \in C_i} \sum_{j_2 \notin C_i} \left|x_{j_1} - x_{j_2}\right|^2 = \frac{1}{2n^2} \left(\sum_{j_1=1}^n \sum_{j_2=1}^n \left|x_{j_1} - x_{j_2}\right|^2 - \sum_{i=1}^k \sum_{j_1 \in C_i} \sum_{j_2 \in C_i} \left|x_{j_1} - x_{j_2}\right|^2\right) \]
The next thing to note is that we have proven that the total sum of squared distances between the members of a set of data is related to the sum of squared distances from their mean with
\[ \sum_{i=1}^n \sum_{j=1}^n \left|x_i - x_j\right|^2 = 2n \sum_{i=1}^n \left|x_i - \bar{x}\right|^2 \]
Now the original data and each of the clusters are trivially sets so we can express their total sums of squared distances in this way and can consequently reduce the right hand side of the formula to an order \(n\) operation with
\[ \sum_{j_1=1}^n \sum_{j_2=1}^n \left|x_{j_1} - x_{j_2}\right|^2 - \sum_{i=1}^k \sum_{j_1 \in C_i} \sum_{j_2 \in C_i} \left|x_{j_1} - x_{j_2}\right|^2 = 2n \sum_{j=1}^n \left|x_j - \bar{x}\right|^2 - \sum_{i=1}^k 2n_i \sum_{j \in C_i} \left|x_j - \bar{x}_i\right|^2 \]
where \(n_i\) is the number of members of the \(i^\mathrm{th}\) cluster and \(\bar{x}_i\) is its mean. Listing 1 provides a pair of helper functions to calculate these sums of squared distances.

Listing 1: Sums Of Squares Distances To Means
function totalSum2(d) {
 var s2 = 0;
 var n = d.size();
 var m, i;

 if(n) {
  m = d.at(0);
  for(i=1;i<n;++i) m = ak.add(m, d.at(i));
  m = ak.div(m, n);

  for(i=0;i<n;++i) s2 += Math.pow(ak.dist(m, d.at(i)), 2);
 }
 return s2;
}

function clusterSum2(c, d) {
 var s2 = 0;
 var n = c.size();
 var m, i;

 if(n) {
  m = d.at(c.at(0));
  for(i=1;i<n;++i) m = ak.add(m, d.at(c.at(i)));
  m = ak.div(m, n);

  for(i=0;i<n;++i) s2 += Math.pow(ak.dist(m, d.at(c.at(i))), 2);
 }
 return s2;
}

Here the argument d is expected to be the data member of an ak.clustering object [5] and the argument c an element of its clusters member. Note that since we are using our overloading mechanism for the arithmetic operations, these functions will work for any compatible data types such as ak.vector[6] objects with equal dimensions or a combination of ak.complex objects[7] and ordinary JavaScript numbers.

The variance accounted for by a clustering is calculated with the ak.betweenClusterVariance function given in listing 2.

Listing 2: The Between Cluster Variance
ak.betweenClusterVariance = function(c) {
 var clusters, data, n, s2, i;

 if(ak.type(c)!==ak.CLUSTERING_T || ak.type(c.data)===ak.UNDEFINED_T) {
  throw new Error('invalid argument in ak.withinClusterVariance');
 }

 clusters = c.clusters;
 data = c.data;
 n = clusters.size();

 s2 = data.size() * totalSum2(data);
 for(i=0;i<n;++i) s2 -= clusters.at(i).size() * clusterSum2(clusters.at(i), data);
 return s2 / Math.pow(data.size(), 2);
};

First we check that the argument c is an ak.clustering object that's been initialised with the data that was clustered so that the calls to helper functions will have the arguments they expect. We then multiply each sum of squares by the number of data used to calculate them so that we correctly accumulate the result and finally divide by the square of the number of data points to transform it into the variance. Note that we've cancelled out the factor of two that appears in both the multiplications and the final division of our formula.

The unaccounted for variance is also a useful figure and so the ak.withinClusterVariance is provided to calculate it, as shown in listing 3.

Listing 3: The Within Cluster Variance
ak.withinClusterVariance = function(c) {
 var clusters, data, n, s2, i;

 if(ak.type(c)!==ak.CLUSTERING_T || ak.type(c.data)===ak.UNDEFINED_T) {
  throw new Error('invalid argument in ak.withinClusterVariance');
 }

 clusters = c.clusters;
 data = c.data;
 n = clusters.size();

 s2 = 0;
 for(i=0;i<n;++i) s2 += clusters.at(i).size() * clusterSum2(clusters.at(i), data);
 return s2 / Math.pow(data.size(), 2);
};

Both of these functions can be found in ClusterVariance.js.

Using The Variance To Select A Clustering

Once we have decided on how much of the variance we want the clustering to account for it is a simple matter to iterate over values for \(k\) until we find one that does so, as demonstrated by program 2.

Program 2: Searching For k

Now it's certainly the case that we could do better, in general, than search for the value of \(k\) by iterating, say by using an integer variant of the bisection method[8], but that is by far the least of the problems that this approach has.
Much more troubling is that we no more know how much of the variance should be accounted for by the optimal clustering than we know how many clusters it should have. In program 1 I deliberately chose the data to have extremely well defined clusters but I still had to experiment before I found a proportion of the variance that more or less consistently identified the correct clustering.
To see this for yourself, try changing the target proportion r or adding another cluster of points that's not so well separated from those that I chose and note how it's not entirely obvious what value r should be set to.

A Change For The Better

Given that fixing the share of the variance isn't a reliable way to correctly identify the clusters then it's not entirely unreasonable to suppose that we might do better by choosing it based upon some property of the data itself. Now we can't very well do so by direct observation of the data since if we could we should be able to identify the correct clustering in the same way. Fortunately we have another source of information; the variance itself!
Specifically, we can look for some clue that we've found the optimal clustering in the way that it changes as we increase \(k\).

Program 3 plots the proportion of the variance accounted for by a \(k\) means clustering of the default data of program 1 against \(k\) and, if you run it several times, you should notice that the rate at which it increases drops significantly at \(k\) equal to five; the correct number of clusters.

Program 3: Variance Versus k

The variability in the outcome of program 3 is a consequence of the randomised nature of the \(k\) means algorithm which we can smooth out by applying it multiple times for each \(k\) and averaging the results, as demonstrated by program 4.

Program 4: Average Variance Versus k

We can exaggerate the drop in the rate of change by plotting the gradient of the graph
\[ v^\prime_k = \frac{v_{k+1}-v_k}{k+1 - k} = v_{k+1}-v_k \]
as is done as a proportion of the total variance by program 5.

Program 5: Average Variance Gradient Versus k

Now this is certainly better than guessing what proportion of the variance the clustering should account for, but deciding when the rate of change is small enough is still a rather subjective business and consequently rather difficult to automate.

References

[1] K Means Business, www.thusspakeak.com, 2015.

[2] The Tripods Are Coming!, www.thusspakeak.com, 2015.

[3] The Tripods Are Here!, www.thusspakeak.com, 2015.

[4] The Principals Of The Things, www.thusspakeak.com, 2014.

[5] Some Of These Things Are Not Like The Others, www.thusspakeak.com, 2015.

[6] What's Our Vector, Victor?, www.thusspakeak.com, 2014.

[7] Our Imaginary Friends, www.thusspakeak.com, 2014.

[8] Rooting Around For Answers, www.thusspakeak.com, 2014.

Leave a comment