In the previous post^{[1]} we saw how to identify subsets of a set of data that are in some sense similar to each other, known as clusters, by constructing sequences of clusterings starting with each datum in its own cluster and ending with all of the data in the same cluster, subject to the constraint that if a pair of data are in the same cluster in one clustering then they must also be in the same cluster in the next, which are known as hierarchical clusterings.
We did this by selecting the closest pairs of clusters in one clustering and merging them to create the next, using one of three different measures of the distance between a pair of clusters; the average distance between their members, the distance between their nearest members and the distance between their farthest members, known as average linkage, single linkage and complete linkage respectively.
Unfortunately our implementation came in at a rather costly \(O\left(n^3\right)\) operations and so in this post we shall look at how we can improve its performance.
We shall still need functions to calculate cluster distances according to those linkages and so we shall begin with a recap of how we implemented them last time.
Note that the elements of the clusters are the indices of the data that they contain in the set \(X\) and that the \(d_{ij}\) are elements of a distance matrix^{[2]} \(\mathbf{D}\) and are defined as
Listing 1 shows our implementation of \(\Delta_\mu\) where the distance matrix
Similarly, the single linkage distance function is defined as
Finally, the complete linkage distance is given by
We demonstrated its results for the set of scalars
Now the fundamental inefficiency of this implementation is the nested iteration over the clusters to find the closest pairs of them and so that is what we shall be optimising.
To update the unmerged clusters' caches efficiently at each step we'll need to add the merged clusters and their distances to each of them without spending the \(O(n\ln n)\) operations required to fully sort them all over again. Fortunately, we have already implemented a minheap structure^{[4]} with
Listing 5 shows the structure of the cache for an
Here you can see that we're initialising each cluster's cache with its
The
Now to update the caches we simply iterate over them in reverse order, adding the new cluster and its distance to existing clusters that come before it to their caches' minheaps and creating a new minheap for it containing the clusters that come after it and their distances, as shown by
Note that we're identifying the caches of the merged clusters by their having a defined
The
Finally, to find a cluster's nearest neighbour we simply need to read the first element in its cache that refers to a cluster that still exists which we can test by simply comparing the cached
First of all we scan though the caches to find the smallest distance between any pair of clusters, as illustrated by listing 10.
Now we can build the list of cluster merge mappings with a second sweep, as shown by listing 11.
Once again we must take care to ensure that multiple merges are handled correctly which is done by the
Note that we're treating a distance of NaN as a special case to mean incomparable that ranks after an infinite distance and so the
The
As before we can treat a clustering with two clusters as a special case since the only possible next step is to merge them and so we terminate the iteration as soon as there are fewer than three clusters in the current clustering.
Finally, we return the results as an
Program 2 demonstrates the results of
Program 3 shows the last few clusterings in the hierarchy generated with the average linkage distance function for a pair of globular clusters.
This looks fairly similar to the results of the \(k\) means algorithm applied to these data with decreasing values of \(k\), as illustrated by program 4.
Unfortunately it also has just as much difficulty identifying nonglobular clusters as the \(k\) means algorithm did, as demonstrated by program 5.
This stands in stark contrast to the single linkage hierarchy for these data whose clusterings program 6 shows look very much like those generated by the shared neighbours clustering algorithm.
Unfortunately, its results are rather less than satisfying for the globular clusters for which it puts most of their data into two clusters quite quickly whilst leaving a few data in their own clusters until the last few clusterings, as demonstrated by program 7.
Given that the average linkage cluster distances lie between the single and complete linkage cluster distances it should come as no surprise that the last few clusterings in the hierarchies generated using complete linkage distances look much more like those generated using average linkage distances than those generated using single linkage distances, as confirmed for globular clusters by program 8.
Program 9 shows the results of using the complete linkage distance measure to generate hierarchical clustering for nonglobular clusters and, as should be expected, they also look much more like those generated using the average linkage distance than those generated using the single linkage distance.
Now this implementation is pretty much as efficient as it's possible to get for general purpose hierarchical clusterings but, perhaps surprisingly, for single linkage hierarchical clusterings we can do significantly better. So much so, in fact, that the sorting by size of the clusters in each clustering would be the most computationally expensive part of creating them!
[2] In The Neighbourhood, www.thusspakeak.com, 2019.
[3] Some Of These Things Are Not Like The Others, www.thusspakeak.com, 2015.
[4] A Heap Of Stuff, www.thusspakeak.com, 2019.
[5] Making The Clade, www.thusspakeak.com, 2019.
[6] K Means Business, www.thusspakeak.com, 2015.
[7] Nearest And Dearest, www.thusspakeak.com, 2019.
We did this by selecting the closest pairs of clusters in one clustering and merging them to create the next, using one of three different measures of the distance between a pair of clusters; the average distance between their members, the distance between their nearest members and the distance between their farthest members, known as average linkage, single linkage and complete linkage respectively.
Unfortunately our implementation came in at a rather costly \(O\left(n^3\right)\) operations and so in this post we shall look at how we can improve its performance.
We shall still need functions to calculate cluster distances according to those linkages and so we shall begin with a recap of how we implemented them last time.
The Linkage Distance Measures
We can calculate the average linkage distance measure for a pair of clusters \(X_i\) and \(X_k\) with
\[
\Delta_\mu\left(X_i, X_j\right) = \frac{\sum_{i{}^\prime \in X_i} \sum_{j{}^\prime \in X_j} d_{i{}^\prime j{} ^\prime}}{X_i \times X_j}
\]
where \(\sum\) is the summation sign, \(\in\) means within and the vertical bars stand for the magnitude of the term between them, which in this case is the number of elements in the cluster.Note that the elements of the clusters are the indices of the data that they contain in the set \(X\) and that the \(d_{ij}\) are elements of a distance matrix^{[2]} \(\mathbf{D}\) and are defined as
\[
d_{ij} = d_{ji} = \delta\left(x_i, x_j\right)
\]
where \(\delta\) is a function that calculates the distance between a pair of data \(x_i\) and \(x_j\) which usually has the properties
\[
\begin{align*}
\delta\left(y, x\right) &= \delta\left(x, y\right)\\
\delta\left(x, x\right) &= 0\\
\delta\left(x, y\right) &> 0
\end{align*}
\]
for distinct \(x\) and \(y\), and
\[
\delta\left(x, z\right) \leqslant \delta\left(x, y\right) + \delta\left(y, z\right)
\]
for any \(x\), \(y\) and \(z\), although we only assumed that the first of these must be true.Listing 1 shows our implementation of \(\Delta_\mu\) where the distance matrix
d
is expected to be an ak.matrix
object and the clusters xi
and xj
are expected to be ak.cluster
objects^{[3]}.Listing 1: Average Linkage Distance
function averageLinkage(d) { var n; if(ak.type(d)!==ak.MATRIX_T  d.rows()!==d.cols()) { throw new Error('invalid distance matrix in averageLinkage'); } n = d.rows(); return function(xi, xj) { var ni = xi.size(); var nj = xj.size(); var i, j, s; i = 0; while(i<ni && xi.at(i)<n) ++i; if(i<ni) throw new Error('invalid distance matrix in averageLinkage'); j = 0; while(j<nj && xj.at(j)<n) ++j; if(j<nj) throw new Error('invalid distance matrix in averageLinkage'); s = 0; for(i=0;i<ni;++i) for(j=0;j<nj;++j) s += d.at(xi.at(i), xj.at(j)); return s/(ni*nj); }; }
Similarly, the single linkage distance function is defined as
\[
\Delta_\alpha\left(X_i, X_j\right) = \min_{i{}^\prime \in X_i \wedge j{}^\prime \in X_j} d_{i{}^\prime j{}^\prime}
\]
where \(\wedge\) means and, and listing 2 gives our implementation of it.Listing 2: Single Linkage Distance
function singleLinkage(d) { var n; if(ak.type(d)!==ak.MATRIX_T  d.rows()!==d.cols()) { throw new Error('invalid distance matrix in singleLinkage'); } n = d.rows(); return function(xi, xj) { var ni = xi.size(); var nj = xj.size(); var i, j, s; i = 0; while(i<ni && xi.at(i)<n) ++i; if(i<ni) throw new Error('invalid distance matrix in singleLinkage'); j = 0; while(j<nj && xj.at(j)<n) ++j; if(j<nj) throw new Error('invalid distance matrix in singleLinkage'); s = ak.INFINITY; for(i=0;i<ni;++i) for(j=0;j<nj;++j) s = Math.min(s, d.at(xi.at(i), xj.at(j))); return s; }; }
Finally, the complete linkage distance is given by
\[
\Delta_\omega\left(X_i, X_j\right) = \max_{i{}^\prime \in X_i \wedge j{}^\prime \in X_j} d_{i{}^\prime j{}^\prime}
\]
as implemented in listing 3.Listing 3: Complete Linkage Distance
function completeLinkage(d) { var n; if(ak.type(d)!==ak.MATRIX_T  d.rows()!==d.cols()) { throw new Error('invalid distance matrix in completeLinkage'); } n = d.rows(); return function(xi, xj) { var ni = xi.size(); var nj = xj.size(); var i, j, s; i = 0; while(i<ni && xi.at(i)<n) ++i; if(i<ni) throw new Error('invalid distance matrix in completeLinkage'); j = 0; while(j<nj && xj.at(j)<n) ++j; if(j<nj) throw new Error('invalid distance matrix in completeLinkage'); s = ak.INFINITY; for(i=0;i<ni;++i) for(j=0;j<nj;++j) s = Math.max(s, d.at(xi.at(i), xj.at(j))); return s; }; }
Our Naive Implementation
Our first implementation of an algorithm for generating hierarchical clusterings worked by iterating over the pairs of clusters in anak.clustering
, keeping a list of the closest pairs according to the linkage
distance measure and merging the clusters in that list to generate the next clustering in the hierarchy, taking care to update the indices of the pairs we have yet to merge in the list as we update the memberships of the data so that we will correctly merge multiple clusters if there should be chains of them where each is separated from the next by the minimum distance, as given in listing 4.Listing 4: Constructing The Hierarchical Clusterings
function nextHierarchical(clustering, linkage) { var clusters = clustering.clusters; var n = clusters.size(); var min = ak.NaN; var mappings = []; var members, i, j, d, c; if(n<=1) return clustering; members = clustering.memberships.toArray(); if(n===2) { for(i=0;i<members.length;++i) members[i] = 0; } else { for(i=0;i<n1;++i) for(j=i+1;j<n;++j) { d = linkage(clusters.at(i), clusters.at(j)); c = ak.floatCompare(d, min); if(c<0) {mappings = [[i,j]]; min = d;} else if(c===0) mappings.push([i,j]); } for(i=0;i<mappings.length1;++i) for(j=i+1;j<mappings.length;++j) { if(mappings[j][0]===mappings[i][1]) mappings[j][0] = mappings[i][0]; if(mappings[j][1]===mappings[i][1]) mappings[j][1] = mappings[i][0]; } for(i=0;i<mappings.length;++i) for(j=0;j<members.length;++j) { if(members[j]===mappings[i][1]) members[j] = mappings[i][0]; } } return ak.clustering(members); }
We demonstrated its results for the set of scalars
\[
X = \{17, 2, 8, 4, 5, 14, 10, 1\}
\]
using our ak.distanceMatrix
function to calculate their distance matrix
\[
\mathbf{D} =
\begin{pmatrix}
0 & 15 & 9 & 13 & 12 & 3 & 7 & 16 \\
15 & 0 & 6 & 2 & 3 & 12 & 8 & 1 \\
9 & 6 & 0 & 4 & 3 & 6 & 2 & 7 \\
13 & 2 & 4 & 0 & 1 & 10 & 6 & 3 \\
12 & 3 & 3 & 1 & 0 & 9 & 5 & 4 \\
3 & 12 & 6 & 10 & 9 & 0 & 4 & 13 \\
7 & 8 & 2 & 6 & 5 & 4 & 0 & 9 \\
16 & 1 & 7 & 3 & 4 & 13 & 9 & 0
\end{pmatrix}
\]
by printing the cluster memberships of each datum in each clustering in the hierarchies resulting from each of the linkage distance measures, which is repeated here by program 1.
Program 1: The Hierarchical Clusterings



Now the fundamental inefficiency of this implementation is the nested iteration over the clusters to find the closest pairs of them and so that is what we shall be optimising.
Got Cache?
To remove this inefficiency the key thing to note is that the distances between all of the clusters that weren't merged from one hierarchical clustering to the next won't have changed and so we can save ourselves a lot of unnecessary work by keeping track of them rather than recalculating them. Specifically, for each cluster we can build a cache of the other clusters sorted from the smallest to the largest distance between them. This means that we can identify the closest pairs of clusters with an \(O(n)\) sweep through the distances to the first cluster in each cluster's cache.To update the unmerged clusters' caches efficiently at each step we'll need to add the merged clusters and their distances to each of them without spending the \(O(n\ln n)\) operations required to fully sort them all over again. Fortunately, we have already implemented a minheap structure^{[4]} with
ak.minHeap
which enables us to add a new element to a sorted list and put it back into sorted order at a cost of \(O(\ln n)\). Note that we don't need to immediately remove the clusters that were merged from those caches since, provided that we can identify them as no longer existing, we can remove them one at a time whilst we're searching for the closest pairs.Listing 5 shows the structure of the cache for an
ak.clustering
object c
and a distance function f
.Listing 5: The Cluster Distance Cache
function cacheCompare(q0, q1) {return ak.floatCompare(q0.d, q1.d);} function cacheElement(c, i, f) { var ci = c.at(i); var qi = []; var n = c.size(); var j, cj; for(j=i+1;j<n;j++) {cj = c.at(j); qi.push({c:cj, d:Number(f(ci, cj))});} return {c:ci, q:ak.minHeap(qi, cacheCompare)}; } function initCache(c, f) { var n = c.size(); var cache = new Array(n); var i; for(i=0;i<n;++i) cache[i] = cacheElement(c, i, f); return cache; }
Here you can see that we're initialising each cluster's cache with its
ak.cluster
object and an ak.minHeap
of the ak.cluster
objects that come after it in the clustering together with the distances to them. Note that in order to maintain a consistent way of identifying the clusters from one clustering to the next we'll be using the indices of their first members in the data set rather then their indices in the clusterings.The
cacheMappings
function given in listing 6 stores those of the clusters that have been merged to in the clustering c0
as k0
and those that are merged from as k1
respectively, updating the ak.cluster
reference to the new cluster in c1
, erasing the minheap for the former and erasing both for the latter.Listing 6: Clearing Merged Clusters From The Cache
function cacheMappings(cache, mappings, c0, c1) { var n = mappings.length; var i, k0, k1; for(i=0;i<n;++i) { k0 = c0.clusters.at(mappings[i][0]).at(0); k1 = c0.clusters.at(mappings[i][1]).at(0); cache[k0].c = c1.clusters.at(c1.memberships.at(k0)); cache[k0].q = undefined; cache[k1].c = undefined; cache[k1].q = undefined; } }
Now to update the caches we simply iterate over them in reverse order, adding the new cluster and its distance to existing clusters that come before it to their caches' minheaps and creating a new minheap for it containing the clusters that come after it and their distances, as shown by
cacheDistances
in listing 7.Listing 7: Updating The Cached Distances
function cacheDistances(cache, f) { var n = cache.length; var i, j, ci, cj, qi; for(i=n1;i>=0;i) { ci = cache[i]; if(ak.nativeType(ci.c)!==ak.UNDEFINED_T && ak.nativeType(ci.q)===ak.UNDEFINED_T) { qi = []; for(j=0;j<i;++j) { cj = cache[j]; if(ak.nativeType(cj.q)!==ak.UNDEFINED_T) { cj.q.add({c:ci.c, d:Number(f(ci.c, cj.c))}); } } for(j=i+1;j<n;++j) { cj = cache[j]; if(ak.nativeType(cj.q)!==ak.UNDEFINED_T) { qi.push({c:cj.c, d:Number(f(ci.c, cj.c))}); } } ci.q = ak.minHeap(qi, cacheCompare); } } }
Note that we're identifying the caches of the merged clusters by their having a defined
ak.cluster
and an undefined ak.minHeap
and the caches of the existing clusters that come after it by their having a defined ak.minHeap
, which will be true even for merged clusters since we're iterating over the clusters' caches in reverse order.The
updateCache
function given in listing 8 calls these two functions in sequence to correctly update the cached distances after each step of the hierarchical clusterings.Listing 8: Updating The Cache
function updateCache(cache, mappings, c0, c1, f) { cacheMappings(cache, mappings, c0, c1); cacheDistances(cache, f); }
Finally, to find a cluster's nearest neighbour we simply need to read the first element in its cache that refers to a cluster that still exists which we can test by simply comparing the cached
ak.cluster
object to the one associated with the cache at the index of its first member, as done by the firstElement
function in listing 9 which throws the cached distances to merged clusters away until it finds an unmerged cluster or the minheap is empty, returning the cached cluster and distance in the first case and the nullElement
object in the second.Listing 9: Finding The Closest Cluster
var nullElement = {c:undefined, d:ak.NaN}; function firstElement(q, cache) { while(q.size()>0 && q.min().c!==cache[q.min().c.at(0)].c) q.shift(); return q.size()>0 ? q.min() : nullElement; }
Use Cache!
All that we need to do now is rewrite the hierarchical clusterings algorithm to use these caches rather than recalculating the cluster distances at each step.First of all we scan though the caches to find the smallest distance between any pair of clusters, as illustrated by listing 10.
Listing 10: Finding The Minimum Distance
function minDist(cache) { var min = ak.NaN; var n = cache.length; var i, qi, d; for(i=0;i<n;++i) { qi = cache[i].q; if(ak.nativeType(qi)!==ak.UNDEFINED_T) { d = firstElement(qi, cache).d; if(isNaN(min)  d<min) min = d; } } return min; }
Now we can build the list of cluster merge mappings with a second sweep, as shown by listing 11.
Listing 11: Building The Merge Mappings
function minMappings(min, cache, memberships) { var mappings = []; var n = cache.length; var i, qi, qi0, ci0, ci1; for(i=0;i<n;++i) { qi = cache[i].q; if(ak.nativeType(qi)!==ak.UNDEFINED_T && qi.size()>0) { ci0 = memberships[cache[i].c.at(0)]; qi0 = qi.min(); while(qi0.d===min) { ci1 = memberships[qi0.c.at(0)]; mappings.push([ci0, ci1]); qi.shift(); qi0 = firstElement(qi, cache); } } } return mappings; }
Once again we must take care to ensure that multiple merges are handled correctly which is done by the
compressMappings
function in listing 12, with the mergeMappings
function calling these functions in order to build the set of mappings.Listing 12: Updating The Mappings
function compressMappings(mappings) { var n = mappings.length; var i, j; for(i=0;i<n1;++i) for(j=i+1;j<n;++j) { if(mappings[j][0]===mappings[i][1]) mappings[j][0] = mappings[i][0]; if(mappings[j][1]===mappings[i][1]) mappings[j][1] = mappings[i][0]; } } function mergeMappings(cache, memberships) { var dist = minDist(cache); var mappings = isNaN(dist) ? [] : minMappings(dist, cache, memberships); if(mappings.length>1) compressMappings(mappings); return mappings; }
Note that we're treating a distance of NaN as a special case to mean incomparable that ranks after an infinite distance and so the
mergeClusters
function defined in listing 13 puts every datum into the same cluster if the mappings are empty, indicating that every cluster is incomparable, and otherwise updates their memberships according to the mappings
.Listing 13: Merging The Clusters
function mergeClusters(mappings, memberships) { var n = memberships.length; var m = mappings.length; var i, j; if(m===0) { for(i=0;i<n;++i) memberships[i] = 0; } else { for(i=0;i<m;++i) { for(j=0;j<n;++j) { if(memberships[j]===mappings[i][1]) memberships[j] = mappings[i][0]; } } } return ak.clustering(memberships); }
The
ak.hierarchicalClustering
function given in listing 14 and defined in HierarchicalClustering.js puts all of these steps together, first checking that the data
argument is either the number of data or an array containing them and that the cluster distance function f
is a function, then initialising the memberships
, clusters
and cache
and finally iteratively building the hierarchical clusterings.Listing 14: ak.hierarchicalClustering
ak.hierarchicalClustering = function(data, f) { var clusters = []; var t = ak.nativeType(data); var n, memberships, i, c0, c1, cache, mappings; if(t===ak.NUMBER_T) { n = data; if(n<0  n!==ak.floor(n)  !isFinite(n)) { throw new Error('invalid number of data in ak.hierarchicalClustering'); } } else if(t===ak.ARRAY_T) { n = data.length; } else throw new Error('invalid data in ak.hierarchicalClustering'); if(ak.nativeType(f)!==ak.FUNCTION_T) { throw new Error('invalid cluster distance function in ak.hierarchicalClustering'); } memberships = new Array(n); for(i=0;i<n;++i) memberships[i] = i; c0 = ak.clustering(memberships); clusters.push(c0); cache = initCache(c0.clusters, f); while(c0.clusters.size()>2) { memberships = c0.memberships.toArray(); mappings = mergeMappings(cache, memberships); c1 = mergeClusters(mappings, memberships); clusters.push(c1); if(c1.clusters.size()>2) updateCache(cache, mappings, c0, c1, f); c0 = c1; } if(c0.clusters.size()===2) { for(i=0;i<n;++i) memberships[i] = 0; clusters.push(memberships); } return t===ak.NUMBER_T ? ak.clusterings(clusters) : ak.clusterings(clusters, data); };
As before we can treat a clustering with two clusters as a special case since the only possible next step is to merge them and so we terminate the iteration as soon as there are fewer than three clusters in the current clustering.
Finally, we return the results as an
ak.clusterings
^{[5]} object.Program 2 demonstrates the results of
ak.hierarchicalClustering
for each of our cluster linkage distance functions for the same data as were used with nextHierarchical
in program 1 so that you can check that they're the same.
Program 2: The New Hierarchical Clusterings



Hierarchical Clusterings Of Vectors
To get a feel for what hierarchical clusterings of nonscalar data look like we can apply it to the set of two dimensional vectors, represented by an array ofak.vector
objects, that we used to demonstrate our ak.kMeansClustering
^{[6]} and ak.sharedNeighboursClustering
^{[7]} functions. As then you can add new data to the set by clicking on the chart area, clear the data by clicking on the button to the left of the reset button and restore the original data with the reset button.Program 3 shows the last few clusterings in the hierarchy generated with the average linkage distance function for a pair of globular clusters.
Program 3: Globular Average Linkage



This looks fairly similar to the results of the \(k\) means algorithm applied to these data with decreasing values of \(k\), as illustrated by program 4.
Program 4: Globular k Means



Unfortunately it also has just as much difficulty identifying nonglobular clusters as the \(k\) means algorithm did, as demonstrated by program 5.
Program 5: Linear Average Linkage



This stands in stark contrast to the single linkage hierarchy for these data whose clusterings program 6 shows look very much like those generated by the shared neighbours clustering algorithm.
Program 6: Linear Single Linkage



Unfortunately, its results are rather less than satisfying for the globular clusters for which it puts most of their data into two clusters quite quickly whilst leaving a few data in their own clusters until the last few clusterings, as demonstrated by program 7.
Program 7: Globular Single Linkage



Given that the average linkage cluster distances lie between the single and complete linkage cluster distances it should come as no surprise that the last few clusterings in the hierarchies generated using complete linkage distances look much more like those generated using average linkage distances than those generated using single linkage distances, as confirmed for globular clusters by program 8.
Program 8: Globular Complete Linkage



Program 9 shows the results of using the complete linkage distance measure to generate hierarchical clustering for nonglobular clusters and, as should be expected, they also look much more like those generated using the average linkage distance than those generated using the single linkage distance.
Program 9: Linear Complete Linkage



Now this implementation is pretty much as efficient as it's possible to get for general purpose hierarchical clusterings but, perhaps surprisingly, for single linkage hierarchical clusterings we can do significantly better. So much so, in fact, that the sorting by size of the clusters in each clustering would be the most computationally expensive part of creating them!
References
[1] A Place In The Hierarchy, www.thusspakeak.com, 2019.[2] In The Neighbourhood, www.thusspakeak.com, 2019.
[3] Some Of These Things Are Not Like The Others, www.thusspakeak.com, 2015.
[4] A Heap Of Stuff, www.thusspakeak.com, 2019.
[5] Making The Clade, www.thusspakeak.com, 2019.
[6] K Means Business, www.thusspakeak.com, 2015.
[7] Nearest And Dearest, www.thusspakeak.com, 2019.
Leave a comment