Last time we saw how we could use the list of the nearest neighbours of a datum in a set of clustered data to calculate its strengths of association with their clusters^{[1]}. Specifically, we used the \(k\) nearest neighbours algorithm to define those strengths as the proportions of its \(k\) nearest neighbours that were members of each cluster or with a generalisation of it that assigned weights to the neighbours according to their positions in the list.
This time we shall take a look at a clustering algorithm that uses nearest neighbours to identify clusters, contrasting it with the \(k\) means clustering algorithm that we covered about four years ago^{[2]}.
Recall that the \(k\) means algorithm constructs a clustering iteratively by first picking \(k\) cluster exemplars and assigning each datum to the cluster whose exemplar it's closest to. The exemplars are then replaced by the means of their clusters and the process is repeated until the cluster memberships stop changing.
If we have a good idea of how many clusters the data have then this can be an effective and efficient way to identify them, as demonstrated by program 1 for a pair of globular clusters using our
Note that you can add new points to the data by clicking on the chart and that you can clear them with the button to the left of the reset button.
Unfortunately it's not so effective if we don't know how many clusters the data have, which you can confirm by setting
We tried to address this by defining measures of what constituted a good clustering^{[3][4]} and then searching for a good value for \(k\), but this was easily frustrated by nonglobular clusters like those in program 2.
Since the \(k\) means algorithm determines cluster memberships by the distance from each datum to the cluster exemplars it's impossible for it to identify clusters like these, no matter how clearly we can see them, since the points at the lower left ends of each cluster are closer to each other than they are to the points at their respective upper right ends.
To do this we further define similarity to be transitive so that if \(a\) is similar to \(b\) and \(b\) is similar to \(c\) then we consider \(a\) to be similar to \(c\). We can then first put each datum in its own cluster and then iterate over them, moving each datum that comes before the current one in the data set, together with any members of its cluster, to the latter's cluster if they are within each others \(k\) nearest neighbours, as done by the
Note that if a datum has already been moved to the current datum's cluster then we don't need to do anything and that otherwise we're using the
Here we are able to use our
Program 3 demonstrates the results of
I recommend that you add clusters at the bottom left and top right by clicking on the chart to add new points and then run the program again so that you can see that it can identify multiple clusters without requiring their number to be specified in advance, unlike
Now, since our definition of similarity is transitive, members of a cluster can be arbitrarily distant from each other if there are a sequence of points that connect them and so this algorithm can identify nonglobular clusters, as demonstrated by program 4.
The
That function is implemented in listing 4 and exploits the fact that the neighbours of each datum are sorted to count how many they share as efficiently as possible.
Specifically, if the \(i^\mathrm{th}\) neighbour of one datum has a lower index than the \(j^\mathrm{th}\) neighbour of another's then we must increment \(i\) and continue the count but if they are equal then we can decrement the target threshold \(t\) and increment both \(i\) and \(j\), stopping if we reach the end of either's neighbours or the threshold falls to zero and consequently giving the algorithm an overall cost of \(O\left(n^2 k\right)\). Note that this means that if we set the target threshold to zero then the loop terminates immediately and we'll get the same results as we would have had from the mutual neighbours algorithm with the same cost of \(O\left(n^2 \ln k\right)\) as it had.
Program 5 shows that
Once again, I'd suggest that you add some more clusters by clicking on the chart area to confirm that the algorithm can identify them.
Program 6 applies it to linear clusters and shows that it's more discriminatory than the mutual near neighbours algorithm, identifying one more cluster in the data.
You should definitely experiment with different values of \(k\) and \(t\) to get a feel for how they affect the results; for example, setting the former to eight and leaving the latter at four further highlights the gaps in the two lines of points.
You can also take a look at the implementation of
[2] K Means Business, www.thusspakeak.com, 2015.
[3] A Good K To Try, www.thusspakeak.com, 2015.
[4] Try Another K, www.thusspakeak.com, 2015.
[5] I Still Haven't Found What I'm Looking For, www.thusspakeak.com, 2017.
[6] Jarvis, R. and Patrick, E., Clustering Using a Similarity Measure Based on Shared Near Neighbors, IEEE Transactions on Computers, Vol. C22, No. 11, 1973.
This time we shall take a look at a clustering algorithm that uses nearest neighbours to identify clusters, contrasting it with the \(k\) means clustering algorithm that we covered about four years ago^{[2]}.
Recall that the \(k\) means algorithm constructs a clustering iteratively by first picking \(k\) cluster exemplars and assigning each datum to the cluster whose exemplar it's closest to. The exemplars are then replaced by the means of their clusters and the process is repeated until the cluster memberships stop changing.
If we have a good idea of how many clusters the data have then this can be an effective and efficient way to identify them, as demonstrated by program 1 for a pair of globular clusters using our
ak.kMeansClustering
implementation.
Program 1: Globular K Means



Note that you can add new points to the data by clicking on the chart and that you can clear them with the button to the left of the reset button.
Unfortunately it's not so effective if we don't know how many clusters the data have, which you can confirm by setting
k
to three or four in program 1, splitting up one or both of the clusters. Furthermore, if you then run it several times you will see that they aren't even consistently identified.We tried to address this by defining measures of what constituted a good clustering^{[3][4]} and then searching for a good value for \(k\), but this was easily frustrated by nonglobular clusters like those in program 2.
Program 2: Linear K Means



Since the \(k\) means algorithm determines cluster memberships by the distance from each datum to the cluster exemplars it's impossible for it to identify clusters like these, no matter how clearly we can see them, since the points at the lower left ends of each cluster are closer to each other than they are to the points at their respective upper right ends.
Mutually Assured Clustering
We can use the nearest neighbours of each datum in a set of data to define a more abstract measure of similarity with being in the same neighbourhood; specifically that one datum is similar to another if they are both in each other's \(k\) nearest neighbours. This has the attractive property of being scale free in the sense that densely populated regions of a space will have small neighbourhoods and sparsely populated regions will have large ones. We can therefore use it to construct clusterings that reflect the local distribution of data rather than being tied down to its global distribution.To do this we further define similarity to be transitive so that if \(a\) is similar to \(b\) and \(b\) is similar to \(c\) then we consider \(a\) to be similar to \(c\). We can then first put each datum in its own cluster and then iterate over them, moving each datum that comes before the current one in the data set, together with any members of its cluster, to the latter's cluster if they are within each others \(k\) nearest neighbours, as done by the
mutualNeighboursClustering
function in listing 1.Listing 1: mutualNeighboursClustering
function mutualNeighboursClustering(data, k, f) { var n, neighbours, memberships, i, j; if(ak.nativeType(data)!==ak.ARRAY_T) { throw new Error('invalid data in mutualNeighboursClustering'); } n = data.length; if(ak.nativeType(k)!==ak.NUMBER_T  ak.floor(k)!==k  k<0  k>n) { throw new Error('invalid neighbourhood size in mutualNeighboursClustering'); } neighbours = ak.nearestNeighbours(data, data, k, f); memberships = new Array(n); for(i=0;i<n;++i) { memberships[i] = i; neighbours[i].sort(ak.numberCompare); for(j=0;j<i;++j) { if(memberships[j]!==i && mutual(neighbours, i, j)) { replace(memberships, memberships[j], i); } } } return ak.clustering(memberships, data); };
Note that if a datum has already been moved to the current datum's cluster then we don't need to do anything and that otherwise we're using the
mutual
and replace
helper functions to determine whether the pair of data are neighbours and update the cluster memberships
respectively, both of which are given in listing 2.Listing 2: The mutual And replace Helper Functions
function mutual(neighbours, i, j) { return ak.binarySearch(neighbours[i], j, ak.numberCompare) && ak.binarySearch(neighbours[j], i, ak.numberCompare); } function replace(memberships, j, i) { var k; for(k=0;k<=j;++k) if(memberships[k]===j) memberships[k] = i; }
Here we are able to use our
ak.binarySearch
algorithm^{[5]} to check whether the pairs of data are in each other's nearest neighbours because we had previously sorted them in mutualNeighboursClustering
. If we hadn't then, although we would have saved the \(O\left(k \ln k\right)\) cost of sorting the neighbours of each of the \(n\) data, we would have paid for it with an \(O(k)\) versus an \(O(\ln k)\) cost of checking whether each of the \(O\left(n^2\right)\) pairs of data are neighbours. Since \(k\) cannot be greater than \(n\) this means that the overall \(O\left(n^2 \ln k\right)\) cost of working with sorted neighbourhoods should be preferred to the \(O\left(n^2 k\right)\) cost of working without them.Program 3 demonstrates the results of
mutualNeighboursClustering
for the same pair of globular clusters as were defined in program 1.
Program 3: Globular Mutual Neighbours



I recommend that you add clusters at the bottom left and top right by clicking on the chart to add new points and then run the program again so that you can see that it can identify multiple clusters without requiring their number to be specified in advance, unlike
ak.kMeansClustering
which you can confirm by doing the same in program 1.Now, since our definition of similarity is transitive, members of a cluster can be arbitrarily distant from each other if there are a sequence of points that connect them and so this algorithm can identify nonglobular clusters, as demonstrated by program 4.
Program 4: Linear Mutual Neighbours



Shared And Shared Alike
The mutual neighbours clustering algorithm is actually a simplification of Jarvis and Patrick's shared near neighbours algorithm^{[5]} which requires that two data must also share some threshold \(t\) of their \(k\) nearest neighbours to be identified as belonging to the same cluster.The
ak.sharedNeighboursClustering
function given in listing 3 adds this condition with the shared
helper function.Listing 3: ak.sharedNeighboursClustering
ak.sharedNeighboursClustering = function(data, k, t, f) { var n, neighbours, memberships, i, j; if(ak.nativeType(data)!==ak.ARRAY_T) { throw new Error('invalid data in ak.sharedNeighboursClustering'); } n = data.length; if(ak.nativeType(k)!==ak.NUMBER_T  ak.floor(k)!==k  k<0  k>n) { throw new Error('invalid neighbourhood size in ak.sharedNeighboursClustering'); } if(ak.nativeType(t)!==ak.NUMBER_T  ak.floor(t)!==t  t<0  t>k) { throw new Error('invalid threshold in ak.sharedNeighboursClustering'); } neighbours = ak.nearestNeighbours(data, data, k, f); memberships = new Array(n); for(i=0;i<n;++i) { memberships[i] = i; neighbours[i].sort(ak.numberCompare); for(j=0;j<i;++j) { if(memberships[j]!==i && mutual(neighbours, i, j) && shared(neighbours, t, i, j)) { replace(memberships, memberships[j], i); } } } return ak.clustering(memberships, data); };
That function is implemented in listing 4 and exploits the fact that the neighbours of each datum are sorted to count how many they share as efficiently as possible.
Listing 4: The shared Helper Function
function shared(neighbours, t, i, j) { var ni = neighbours[i]; var nj = neighbours[j]; var n = ni.length; var c; i = j = 0; while(i<n && j<n && t>0) { c = ni[i]nj[j]; if(c<=0) ++i; if(c>=0) ++j; if(c===0) t; } return t===0; }
Specifically, if the \(i^\mathrm{th}\) neighbour of one datum has a lower index than the \(j^\mathrm{th}\) neighbour of another's then we must increment \(i\) and continue the count but if they are equal then we can decrement the target threshold \(t\) and increment both \(i\) and \(j\), stopping if we reach the end of either's neighbours or the threshold falls to zero and consequently giving the algorithm an overall cost of \(O\left(n^2 k\right)\). Note that this means that if we set the target threshold to zero then the loop terminates immediately and we'll get the same results as we would have had from the mutual neighbours algorithm with the same cost of \(O\left(n^2 \ln k\right)\) as it had.
Program 5 shows that
ak.sharedNeighboursClustering
classifies points in globular clusters as expected.
Program 5: Globular Shared Neighbours



Once again, I'd suggest that you add some more clusters by clicking on the chart area to confirm that the algorithm can identify them.
Program 6 applies it to linear clusters and shows that it's more discriminatory than the mutual near neighbours algorithm, identifying one more cluster in the data.
Program 6: Linear Shared Neighbours



You should definitely experiment with different values of \(k\) and \(t\) to get a feel for how they affect the results; for example, setting the former to eight and leaving the latter at four further highlights the gaps in the two lines of points.
You can also take a look at the implementation of
ak.sharedNeighboursClustering
in SharedNeighboursClustering.js.
References
[1] In The Neighbourhood, www.thusspakeak.com, 2019.[2] K Means Business, www.thusspakeak.com, 2015.
[3] A Good K To Try, www.thusspakeak.com, 2015.
[4] Try Another K, www.thusspakeak.com, 2015.
[5] I Still Haven't Found What I'm Looking For, www.thusspakeak.com, 2017.
[6] Jarvis, R. and Patrick, E., Clustering Using a Similarity Measure Based on Shared Near Neighbors, IEEE Transactions on Computers, Vol. C22, No. 11, 1973.
Leave a comment