Last time^{[1]} we took a first brief look at cluster analysis in which we seek to algorithmically differentiate sets of data into subsets of similar data. The difficulty in doing so stems from the fact that similarity is a rather poorly defined concept and so clustering algorithms typically proceed by updating cluster memberships according to some rule of thumb, or heuristic, that is designed to reflect some intuitive notion of similarity. As a consequence, clusters have the rather unusually circular definition of that which are identified by a clustering algorithm!
We then took a rather tedious detour into the implementation of a class to represent the results of a clustering, both as a mapping from member IDs to cluster IDs and as a mapping from cluster IDs to sets of member IDs, deferring the definition of our first clustering algorithm to this post.
Congratulations if you spotted the gigantic unstated assumption that I slipped into that description; that the memberships will stop changing!
The first thing to note is that if we take a set of \(n\) numbers \(x_i\) and define a function \(f(x)\) as the sum of the squared differences between \(x\) and those numbers
The next thing to note is that the distance between two vectors is equal to the square root of the sum of the squares of the differences between their elements. This follows from the fact that the length of a vector, denoted by vertical bars on either side of it, is equal to the square root of its product with itself
and that the distance between two vectors is the length of their vector difference, given by taking the differences between each element. Squaring this distance, we have
The upshot of these observations is that when a datum is moved from its current cluster to one with a closer representative the total sum of the squared distances between the data and their cluster representatives must decrease. Initially because that datum now trivially has a smaller squared distance to its cluster representative and thereafter because the updated representatives of the two affected clusters must minimise the sum of the squared distances to their new sets of members.
Now if this sounds somewhat familiar, it should; the \(k\) means heuristic is simply a local minimisation algorithm just like the polytope method that we implemented just a few posts ago^{[2][3]}.
Fortunately, we have a significant advantage that we didn't have before; we know what the problem we're trying to solve is and so we can make an educated guess as to roughly where a good solution might be found. Specifically, if our data have well separated clusters then we should expect that their representatives should be well separated too and so we can simply choose initial representatives from our data that are, well, well separated.
It so happens that we can do this in a way that provably ensures that the \(k\) means algorithm has a good probability of converging upon a near optimal clustering, in the sense of minimising the sum of squared distances between the representatives and members of the final clusters, and furthermore will generally do so in relatively few steps^{[4]}.
The trick is to pick the first cluster representative from our data at random and thereafter base the probability of selecting a datum as an initial representative upon its distance to the nearest of those already chosen, as compared to such distances of the other data, with those that have the greater distances having the greater probabilities of selection.
Specifically, if the distance from the \(i^\mathrm{th}\) datum to the nearest of the existing cluster representatives is \(d_i\) then its probability of being selected to be the next of them is
So the first thing that we do is check that
Next, if
If
Finally, we construct an
The implementation of
Whilst the algorithm for selecting the initial representatives isn't particularly complicated there are some corner cases that we must be careful to correctly handle.
Firstly, if there are no finite data points from which to select them or if
Next, for each subsequent representative, we loop through the finite data calculating their minimum distances to any of the already chosen representatives with the
Finally, when iterating through the data and subtracting their minimum distances from the random fraction of their sum it is possible that, due to rounding errors, we will reach the end of the loop without it falling below zero, albeit with a tiny probability unless their sum was close to underflow and so only accurate to a fraction of its digits. In such cases we simply rewind to the last of them that had a nonzero minimum distance, which we know must exist since that sum would have been zero otherwise and we would have already returned from the function.
These corner cases dealt with, listing 3 gives the implementation of the
Having done everything necessary to set up the initial cluster representatives we're ready to implement the \(k\) means algorithm itself with the
This is a reasonably straightforward implementation of the algorithm, but there are still a couple of subtleties that bear mention.
Firstly, we need to handle the nonfinite data points that are not referenced by the
Next, for each of the finite data points we find the nearest representative with the helper function
If it has changed then we iterate through the data updating the representatives, using
Finally, if it hasn't changed we simply stop and return the cluster memberships.
The last piece of the puzzle is the implementation of
And with that in place we're finally ready to take a look at the \(k\) means algorithm in action!
Now these data have been specifically chosen to have two clusters that are easily identified by the \(k\) means algorithm and you'll note that they are consistently identified in multiple runs of the program.
To see one of the ways that the algorithm can go wrong, set
Another problem with the \(k\) means algorithm is that its implicit definition of similarity is one that minimises the distances between points within the same clusters versus the distances between points in different clusters. As noted in the last post, this means that if we stretch the points out diagonally then we shall fail to correctly identify the clusters, as illustrated by program 2.
Note that in both of these programs you can add data points to the
Whilst there's not very much that we can do to improve the \(k\) means algorithm's treatment of nonglobular clusters, there are several schemes for determining an optimal choice for \(k\), for various definitions of optimal, and we shall take a look at some of them in the next few posts. Until then you can find the implementation of
[2] The Tripods Are Coming!, www.thusspakeak.com, 2015.
[3] The Tripods Are Here!, www.thusspakeak.com, 2015.
[4] Arthur, D. & Vassilvitskii, S., kmeans++: The Advantages of Careful Seeding, Proceedings of the Eighteenth Annual ACMSIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, pp. 10271035, 2007.
[5] What's Our Vector, Victor?, www.thusspakeak.com, 2014.
[6] Our Imaginary Friends, www.thusspakeak.com, 2014.
We then took a rather tedious detour into the implementation of a class to represent the results of a clustering, both as a mapping from member IDs to cluster IDs and as a mapping from cluster IDs to sets of member IDs, deferring the definition of our first clustering algorithm to this post.
The K Means Algorithm
Perhaps the simplest clustering heuristic is the one employed by the \(k\) means algorithm; pick \(k\) points at random as representatives of clusters and assign each point to the cluster whose representative it is closest to. The representatives are then replaced by the average of their cluster's members and the process is repeated until the memberships stop changing.Congratulations if you spotted the gigantic unstated assumption that I slipped into that description; that the memberships will stop changing!
Why It Converges
Derivation 1: Minimising The Sum Of Squared Differences
Given that we can set \(x\) to be as much greater as we like than the greatest of the \(x_i\), the sum of the squared differences between the former and the latter cannot have a maximum value.
It must therefore be the case that the minimum of \(f\) occurs where its derivative equals zero.
It must therefore be the case that the minimum of \(f\) occurs where its derivative equals zero.
\[
\frac{\mathrm{d}}{\mathrm{d}x} \sum_{i=1}^n \left(xx_i\right)^2 = 2 \sum_{i=1}^n \left(xx_i\right) = 0
\]
Rearranging yields
\[
\begin{align*}
2 \sum_{i=1}^n x &= 2 \sum_{i=1}^n x_i\\
nx &= \sum_{i=1}^n x_i\\
x &= \frac{1}{n} \sum_{i=1}^n x_i
\end{align*}
\]
\[
f(x) = \sum_{i=1}^n \left(xx_i\right)^2
\]
where \(\sum\) is the summation sign, then that function takes its minimum value at their mean
\[
\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i
\]
as proven in derivation 1.The next thing to note is that the distance between two vectors is equal to the square root of the sum of the squares of the differences between their elements. This follows from the fact that the length of a vector, denoted by vertical bars on either side of it, is equal to the square root of its product with itself
\[
\mathbf{x} = \sqrt{\mathbf{x} \times \mathbf{x}} = \sqrt{\sum_i x_i \times x_i} = \sqrt{\sum_i x_i^2}
\]
and that the distance between two vectors is the length of their vector difference, given by taking the differences between each element. Squaring this distance, we have
\[
\mathbf{x}  \mathbf{y}^2 = \sum_i \left(x_i  y_i\right)^2
\]
and, given that each term in the sum clearly has no impact upon any of the others, the vector that has the smallest squared distance from a set of vectors must be the one whose elements have the smallest squared distances from the respective elements of those vectors, or in other words is their mean.The upshot of these observations is that when a datum is moved from its current cluster to one with a closer representative the total sum of the squared distances between the data and their cluster representatives must decrease. Initially because that datum now trivially has a smaller squared distance to its cluster representative and thereafter because the updated representatives of the two affected clusters must minimise the sum of the squared distances to their new sets of members.
Now if this sounds somewhat familiar, it should; the \(k\) means heuristic is simply a local minimisation algorithm just like the polytope method that we implemented just a few posts ago^{[2][3]}.
I Wouldn't Start From Here If I Were You
One of the weaknesses of local minimisation algorithms is that the results upon which they converge are highly dependent upon their starting points. An unfortunate choice of the initial cluster representatives might therefore result in a clustering that has far from the smallest possible sum of squared distances between the data and their cluster representatives.Fortunately, we have a significant advantage that we didn't have before; we know what the problem we're trying to solve is and so we can make an educated guess as to roughly where a good solution might be found. Specifically, if our data have well separated clusters then we should expect that their representatives should be well separated too and so we can simply choose initial representatives from our data that are, well, well separated.
It so happens that we can do this in a way that provably ensures that the \(k\) means algorithm has a good probability of converging upon a near optimal clustering, in the sense of minimising the sum of squared distances between the representatives and members of the final clusters, and furthermore will generally do so in relatively few steps^{[4]}.
The trick is to pick the first cluster representative from our data at random and thereafter base the probability of selecting a datum as an initial representative upon its distance to the nearest of those already chosen, as compared to such distances of the other data, with those that have the greater distances having the greater probabilities of selection.
Specifically, if the distance from the \(i^\mathrm{th}\) datum to the nearest of the existing cluster representatives is \(d_i\) then its probability of being selected to be the next of them is
\[
\frac{d_i}{\sum_{j=1}^n d_j}
\]
To pick a new initial representative with that probability we simply generate a standard uniform random number and iterate through the data subtracting their selection probabilities from it until it falls below zero. The datum at which it does so is then added to the initial cluster representatives and the process is repeated until we have \(k\) of them. Note that the data that have already been selected must have themselves as their nearest representative and, having a distance to themselves of zero, trivially cannot be selected again.
ak.kMeansClustering
Our implementation of the \(k\) means algorithm,ak.kMeansClustering
, follows our usual scheme for implementing nonoverloaded numerical functions of explicitly checking the validity of the arguments before deferring to helper functions to do the actual work, as illustrated by listing 1.Listing 1: ak.kMeansClustering
ak.kMeansClustering = function(data, k, rnd) { var ids = []; var i; if(ak.nativeType(data)!==ak.ARRAY_T) { throw new Error('invalid data in ak.kMeansClustering'); } for(i=0;i<data.length;++i) { for(j=0;j<i;++j) ak.dist(data[i], data[j]); if(isFinite(ak.abs(data[i]))) ids.push(i); } if(ak.nativeType(rnd)===ak.UNDEFINED_T) rnd = Math.random; else if(ak.nativeType(rnd)!==ak.FUNCTION_T) { throw new Error('invalid random number generator in ak.kMeansClustering'); } if(ak.nativeType(k)===ak.NUMBER_T) { if(k!==ak.floor(k)  k<0) { throw new Error('invalid initial means in ak.kMeansClustering'); } k = initialMeans(data, ids, k, rnd); } else if(ak.nativeType(k)!==ak.ARRAY_T) { throw new Error('invalid initial means in ak.kMeansClustering'); } else { k = k.slice(0); i = 0; while(i<k.length) { for(j=0;j<data.length;++j) ak.dist(k[i], data[j]); if(isFinite(ak.abs(k[i]))) { ++i; } else { k[i] = k[k.length1]; k.pop(); } } } return ak.clustering(kMeans(data, ids, k), data); };
So the first thing that we do is check that
data
is an array of compatible numeric objects by calling ak.dist
with each pair of them and relying upon it to throw an exception if they're not. Typically they shall all be ak.vector
objects^{[5]} of equal dimension, but we might also like to represent the points with ak.complex
objects^{[6]} which we can freely mix with ordinary numbers. We also keep track of those that have finite magnitude in the ids
array.Next, if
k
is a number then we check that it is a positive integer before replacing it with the result of the initialMeans
helper function which, as its name suggests, implements the algorithm described above for the selection of the initial cluster representatives using rnd
as the random number generator and defaulting to Math.random
if it's undefined.If
k
is not a number then we require it to be an array of numeric objects that are compatible with the data. Whilst we check that this is the case, we remove any of them that do not have a finite magnitude.Finally, we construct an
ak.clustering
with the results of the kMeans
helper function that implements the \(k\) means algorithm itself.The implementation of
initialMeans
is given in listing 2.Listing 2: The initialMeans Helper Function
function initialMeans(data, ids, k, rnd) { var n = ids.length; var m = []; var dist, r, s, i, j; if(n===0  k===0) return m; dist = new Array(n); r = rnd(); if(!(r>=0)  r>=1) { throw new Error('invalid random number generator in ak.kMeansClustering'); } m.push(data[ids[ak.floor(r*n)]]); for(i=1;i<k;++i) { s = 0; for(j=0;j<n;++j) { dist[j] = minDist(data[ids[j]], m); s += dist[j]; } if(s===0) return m; r = rnd(); if(!(r>=0)  r>=1) { throw new Error('invalid random number generator in ak.kMeansClustering'); } s *= r; for(j=0;j<n && s>=dist[j];++j) s = dist[j]; if(j===n) while(dist[j]===0); m.push(data[ids[j]]); } return m; }
Whilst the algorithm for selecting the initial representatives isn't particularly complicated there are some corner cases that we must be careful to correctly handle.
Firstly, if there are no finite data points from which to select them or if
k
is zero we simply return an empty array, otherwise we pick one of the finite data as the initial representative.Next, for each subsequent representative, we loop through the finite data calculating their minimum distances to any of the already chosen representatives with the
minDist
helper function. If the sum of these distances is zero then the chosen representatives must already contain every distinct finite point and so we return them immediately.Finally, when iterating through the data and subtracting their minimum distances from the random fraction of their sum it is possible that, due to rounding errors, we will reach the end of the loop without it falling below zero, albeit with a tiny probability unless their sum was close to underflow and so only accurate to a fraction of its digits. In such cases we simply rewind to the last of them that had a nonzero minimum distance, which we know must exist since that sum would have been zero otherwise and we would have already returned from the function.
These corner cases dealt with, listing 3 gives the implementation of the
minDist
helper function.Listing 3: The minDist Helper Function
function minDist(v, k) { var n = k.length; var dc = ak.dist(v, k[0]); var i, di; for(i=1;i<n;++i) { di = ak.dist(v, k[i]); if(!(di>dc)) dc = di; } return dc; }
Having done everything necessary to set up the initial cluster representatives we're ready to implement the \(k\) means algorithm itself with the
kMeans
helper function given in listing 4.Listing 4: The kMeans Helper Function
function kMeans(data, ids, k) { var nd = data.length; var ni = ids.length; var nk = k.length var md = new Array(nd); var ck = new Array(nk); var i, id, diff, mi, j; for(i=0;i<nd;++i) md[i] = nk; if(nk===0) return md; do { diff = false; for(i=0;i<ni;++i) { id = ids[i]; mi = nearest(data[id], k); if(mi!==md[id]) { diff = true; md[id] = mi; } } if(diff) { for(i=0;i<nk;++i) ck[i] = 0; for(i=0;i<ni;++i) { id = ids[i]; j = md[id]; k[j] = ck[j]!==0 ? ak.add(k[j], data[id]) : data[id]; ++ck[j]; } for(j=0;j<nk;++j) k[j] = ak.div(k[j], ck[j]); } } while(diff); return md; }
This is a reasonably straightforward implementation of the algorithm, but there are still a couple of subtleties that bear mention.
Firstly, we need to handle the nonfinite data points that are not referenced by the
ids
array. We do this by first assigning every member to a cluster whose ID is greater than the number of representatives in k
so that it will contain all of the points that were not clustered. Note that if k
is empty we return it immediately since none of them would be clustered.Next, for each of the finite data points we find the nearest representative with the helper function
nearest
and, if it's different its previous nearest representative, we mark the clustering as having changed.If it has changed then we iterate through the data updating the representatives, using
ck
to keep track of the number of members in each cluster. Note that in the unlikely event that a cluster ends up with no members, its updated representative will be the result of a division by zero and will consequently involve either infinities, NaNs or both. The resulting possibility that the distance between a data point and a cluster representative might be NaN is the reason why the minDist
helper function used a not greater than comparison rather than a less than comparison, ensuring that nonNaN distances, of which there must be at least one since if a cluster ends up empty then another must have captured its members, will be preferred to NaN distances.Finally, if it hasn't changed we simply stop and return the cluster memberships.
The last piece of the puzzle is the implementation of
nearest
, which is pretty similar to minDist
as illustrated by listing 5.Listing 5: The nearest Helper Function
function nearest(v, k) { var n = k.length; var c = 0; var dc = ak.dist(v, k[0]); var i, di, c; for(i=1;i<n;++i) { di = ak.dist(v, k[i]); if(!(di>dc)) { c = i; dc = di; } } return c; }
And with that in place we're finally ready to take a look at the \(k\) means algorithm in action!
Using ak.kMeans
Program 1 has a predefined variablepoints
containing an array of ak.vector
data points to be clustered which are plotted in black before the program is run. It uses ak.kMeans
to cluster the data and then plots them in colours identifying which cluster that they are members of.
Program 1: Globular Clusters Good



Now these data have been specifically chosen to have two clusters that are easily identified by the \(k\) means algorithm and you'll note that they are consistently identified in multiple runs of the program.
To see one of the ways that the algorithm can go wrong, set
k
to three or more rather than two and then run the program a few times. Not only are the resulting clusters rather unintuitive, but they are also inconsistently identified.Another problem with the \(k\) means algorithm is that its implicit definition of similarity is one that minimises the distances between points within the same clusters versus the distances between points in different clusters. As noted in the last post, this means that if we stretch the points out diagonally then we shall fail to correctly identify the clusters, as illustrated by program 2.
Program 2: Linear Clusters Bad



Note that in both of these programs you can add data points to the
points
array by clicking on the chart area and empty it by clicking on the Clear button to the left of the Reset button. To simplify the implementation of this, the bounds
property access method of the scatter
object has been set to read only so you won't be able to change the ranges of the axes from their defaults of zero to one.Whilst there's not very much that we can do to improve the \(k\) means algorithm's treatment of nonglobular clusters, there are several schemes for determining an optimal choice for \(k\), for various definitions of optimal, and we shall take a look at some of them in the next few posts. Until then you can find the implementation of
ak.kMeansClustering
in KMeansClustering.js.
References
[1] Some Of These Things Are Not Like The Others, www.thusspakeak.com, 2015.[2] The Tripods Are Coming!, www.thusspakeak.com, 2015.
[3] The Tripods Are Here!, www.thusspakeak.com, 2015.
[4] Arthur, D. & Vassilvitskii, S., kmeans++: The Advantages of Careful Seeding, Proceedings of the Eighteenth Annual ACMSIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, pp. 10271035, 2007.
[5] What's Our Vector, Victor?, www.thusspakeak.com, 2014.
[6] Our Imaginary Friends, www.thusspakeak.com, 2014.
Leave a comment