A Little Bit Slinky

| 0 Comments

For several months we've for been taking a look at cluster analysis which seeks to partition sets of data into subsets of similar data, known as clusters[1]. Most recently we have focused our attention on hierarchical clusterings[2], which are sequences of sets of clusters in which pairs of data that belong to the same cluster at one step belong to the same cluster in the next step.
A simple way of constructing them is to initially place each datum in its own cluster and then iteratively merge the closest pairs of clusters in each clustering to produce the next one in the sequence, stopping when all of the data belong to a single cluster. We have considered three ways of measuring the distance between pairs of clusters, the average distance between their members, the distance between their closest members and the distance between their farthest members, known as average linkage, single linkage and complete linkage respectively, and implemented a reasonably efficient algorithm for generating hierarchical clusterings defined with them[3], using a min-heap structure[4] to cache the distances between clusters.
Finally, I claimed that there is a more efficient algorithm for generating single linkage hierarchical clusterings that would make the sorting of clusters by size in our ak.clustering type too expensive and so last time[5] we implemented the ak.rawClustering type to represent clusterings without sorting their clusters which we shall now use in the implementation of that algorithm.

Single Linkage Distances

Recall that, for sets \(X_i\) and \(X_j\) of the indices of the \(i^\mathrm{th}\) and \(j^\mathrm{th}\) clusters' members within the data set, we expressed the single linkage cluster distance \(\Delta_\alpha\) as
\[ \Delta_\alpha\left(X_i, X_j\right) = \min_{i{}^\prime \in X_i \wedge j{}^\prime \in X_j} d_{i{}^\prime j{}^\prime} \]
where \(\in\) means within, \(\wedge\) means and and \(d_{ij}\) are the elements of the distance matrix \(\mathbf{D}\)[6], given by
\[ d_{ij} = d_{ji} = \delta\left(x_i, x_j\right) \]
where \(\delta\) is a function that measures the distances between pairs of the data, typically satisfying
\[ \begin{align*} \delta\left(y, x\right) &= \delta\left(x, y\right)\\ \delta\left(x, x\right) &= 0\\ \delta\left(x, y\right) &> 0 \end{align*} \]
for distinct \(x\) and \(y\), and
\[ \delta\left(x, z\right) \leqslant \delta\left(x, y\right) + \delta\left(y, z\right) \]
for all \(x\), \(y\) and \(z\).

Listing 1 shows how we implemented the single linkage cluster distance, representing the distance matrix with an ak.matrix object d and the clusters with ak.cluster objects xi and xj.

Listing 1: Single Linkage Distance
function singleLinkage(d) {
 var n;

 if(ak.type(d)!==ak.MATRIX_T || d.rows()!==d.cols()) {
  throw new Error('invalid distance matrix in singleLinkage');
 }
 n = d.rows();

 return function(xi, xj) {
  var ni = xi.size();
  var nj = xj.size();
  var i, j, s;

  i = 0;
  while(i<ni && xi.at(i)<n) ++i;
  if(i<ni) throw new Error('invalid distance matrix in singleLinkage');

  j = 0;
  while(j<nj && xj.at(j)<n) ++j;
  if(j<nj) throw new Error('invalid distance matrix in singleLinkage');

  s = ak.INFINITY;
  for(i=0;i<ni;++i) for(j=0;j<nj;++j) s = Math.min(s, d.at(xi.at(i), xj.at(j)));
  return s;
 };
}

Program 1 prints the cluster memberships of the data
\[ X = \{17, 2, 8, 4, 5, 14, 10, 1\} \]
for the hierarchical clustering that results from using this single linkage cluster distance function with ak.hierarchicalClustering , using ak.distanceMatrix to calculate their distance matrix.

Program 1: Single Linkage Clusterings

A Convenient Truth

The key to generating single linkage hierarchical clusterings efficiently lies in a property of their distance measures that we haven't considered.
Suppose that we have three clusters \(X_i\), \(X_j\) and \(X_k\) having \(n_i\), \(n_j\) and \(n_k\) members respectively, which we might represent as the sets of their members' indices with
\[ \begin{align*} X_i &= \left\{i_0, i_1, \dots, i_{n_i-2}, i_{n_i-1}\right\}\\ X_j &= \left\{j_0, j_1, \dots, j_{n_j-2}, j_{n_j-1}\right\}\\ X_k &= \left\{k_0, k_1, \dots, k_{n_k-2}, k_{n_k-1}\right\} \end{align*} \]
Now, let us label the indices of the closest members of \(X_i\) and \(X_j\) with \(i^\prime\) and \(j{}^\prime\) and those of \(X_i\) and \(X_k\) with \(i^\ast\) and \(k^\ast\) so that
\[ \begin{align*} \Delta_\alpha\left(X_i, X_j\right) &= d_{i{}^\prime j{}^\prime}\\ \Delta_\alpha\left(X_i, X_k\right) &= d_{i{}^\ast k^\ast} \end{align*} \]
If we merge \(X_j\) and \(X_k\) to yield
\[ X_j^\prime = X_j \cup X_k = \left\{j_0, j_1, \dots, j_{n_j-2}, j_{n_j-1}, k_0, k_1, \dots, k_{n_k-2}, k_{n_k-1}\right\} \]
where \(\cup\) stands for the union of the sets on either side of it, then it must be the case that
\[ \Delta_\alpha\left(X_i, X_j^\prime\right) = \min\left(d_{i{}^\prime j{}^\prime}, d_{i^\ast k^\ast}\right) \]
since otherwise there would have to be members of \(X_i\) and \(X_j\) that were closer to each other than those at both \(i^\prime\) and \(j{}^\prime\) and \(i^\ast\) and \(k^\ast\), or members of \(X_i\) and \(X_k\) that were, which is impossible since in both cases there can be none closer than the one or the other pair.
We can therefore calculate the distance from a merged cluster to any other cluster without having to iterate over their members, saving ourselves a lot of work. All that we need to do is redesign our caching strategy to exploit this fact.

Creating And Using The Cache

The first step is to remove the min-heaps from the cache and replace them with a single entry containing the indices of the closest elements in the data set that come after each datum in it, together with their distance from it, as shown by listing 2.

Listing 2: Flattening The Cache
function initCache(dist) {
 var n = dist.length;
 var cache = new Array(n);
 var i, j, ci, d, cmp;

 for(i=0;i<n;++i) cache[i] = {c:[], d:ak.NaN};
 for(i=0;i<n-1;++i) {
  ci = cache[i];
  for(j=i+1;j<n;++j) {
   d = dist[i][j];
   cmp = ak.floatCompare(d, ci.d);
   if(cmp<0) {ci.c = [j]; ci.d = d;}
   else if(cmp===0) ci.c.push(j);
  }
 }
 return cache;
}

Note that we shall be identifying clusters by their positions in the cache which will equivalent to the index of their first members that we used in our general purpose algorithm provided we only merge clusters into ones that come before them. Furthermore, we shall interpret NaN distances as meaning incomparable and so we're using our ak.floatCompare comparison function so that they compare greater than any other number and consequently incomparable clusters will be the last to be merged.

The minDist function given in listing 3 searches for the smallest distance between the clusters, again treating NaN distances as the largest.

Listing 3: The Distance Between The Closest Clusters
function minDist(cache) {
 var n = cache.length;
 var min = ak.NaN;
 var i, d;

 for(i=0;i<n-1;++i) {
  d = cache[i].d;
  if(isNaN(min) || d<min) min = d;
 }
 return min;
}

Once we have the minimum distance we can build a list of clusters to merge by sweeping through the cache and saving every pair of clusters that are separated by that distance, as shown by listing 4.

Listing 4: The List Of Clusters To Merge
function minMappings(min, cache, memberships) {
 var n = cache.length;
 var mappings = [];
 var i, mi, ci, ni, j;

 for(i=0;i<n-1;++i) {
  if(cache[i].d===min) {
   mi = memberships[i];
   ci = cache[i].c;
   ni = ci.length;
   for(j=0;j<ni;++j) mappings.push([mi, ci[j]]);
  }
 }
 return mappings;
}

Note that this list will be sorted, first by the clusters to merge to and then by the clusters to merge from, represented by the first and second elements of the arrays in mappings.

As before, we need to take care to update these mappings to make sure that multiple merges involving the same clusters are properly managed, which is done by the compressMappings function defined in listing 5.

Listing 5: Updating The Mappings
function compressMappings(mappings) {
 var n = mappings.length;
 var i, j, mi, mj;

 for(i=0;i<n-1;++i) {
  mi = mappings[i];
  for(j=i+1;j<n;++j) {
   mj = mappings[j];
   if(mj[0]===mi[1])  mj[0] = mi[0];
   if(mj[1]===mi[1]) {mj[1] = mj[0]; mj[0] = mi[0];}
  }
 }
}

Once again we're simply replacing any cluster indices in mappings that occur after a given mapping and equal the index that the latter is merging from with the index that it is merging to, although this time we're making sure that the first index in each mapping is always smaller than the second by swapping them when the second is updated so that we can guarantee that the direction of the merge is always into those that come first in the cache, as noted above.

The mergeMappings function given in listing 6 uses these functions to create the list of cluster mappings, returning an empty list if none of them are comparable.

Listing 6: Creating The Mappings
function mergeMappings(cache, memberships) {
 var dist = minDist(cache);
 var mappings = isNaN(dist) ? [] : minMappings(dist, cache, memberships);
 if(mappings.length>1) compressMappings(mappings);
 return mappings;
}

To merge the clusters we simply update the cluster memberships of the data using the mappings, setting them all to the same cluster if the latter is empty and constructing an ak.rawClustering with the result, as demonstrated by listing 7.

Listing 7: Merging The Clusters
function mergeClusters(mappings, memberships) {
 var n = memberships.length;
 var m = mappings.length;
 var i, j;

 if(m===0) {
  for(i=0;i<n;++i) memberships[i] = 0;
 }
 else {
  for(i=0;i<m;++i) {
   for(j=0;j<n;++j) {
    if(memberships[j]===mappings[i][1]) memberships[j] = mappings[i][0];
   }
  }
 }
 return ak.rawClustering(memberships);
}

Updating The Cache

To update the cache, we first update the distance matrix to reflect the new closest cluster distances. To do so we simply set the elements in each column of the row associated with the cluster being merged to with the lesser of the elements in that column in its row and the in row associated with the cluster being merged from, as illustrated by listing 8.

Listing 8: Updating The Distance Matrix
function updateDistances(dist, mappings) {
 var n = dist.length;
 var m = mappings.length;
 var i, j, mi, di0, di1;

 for(i=0;i<m;++i) {
  mi = mappings[i];
  di0 = dist[mi[0]];
  di1 = dist[mi[1]];
  for(j=0;j<n;++j) di0[j] = ak.floatCompare(di0[j], di1[j])<=0 ? di0[j] : di1[j];
 }
}

Now there are a couple of subtleties in this function that require explanation.
Firstly, whilst clusters are guaranteed to be merged into those with a lower index in the cache we didn't sort the mappings after we updated them to properly handle multiple merges and so we cannot assume that we will merge clusters in the order that they appear in it.
Secondly, by noting that
\[ \min\left(x, \min\left(y, z\right)\right) = \min\left(\min\left(x, y\right), \min\left(x, z\right)\right) \]
we can be certain that we will end up with the correct distances for the clusters that we merge into provided that we update every column in their associated rows in the distance matrix so that the order in which we do so doesn't affect the final results.

The next step is to clear the cache entries for all of the clusters that were merged, as shown by listing 9.

Listing 9: Clearing The Merged Clusters' Cache Entries
function cacheClear(cache, mappings) {
 var m = mappings.length;
 var i, mi, mi0, mi1, ci0, ci1;

 for(i=0;i<m;++i) {
  mi = mappings[i];
  mi0 = mi[0]; ci0 = cache[mi0];
  mi1 = mi[1]; ci1 = cache[mi1];

  ci0.c.length = 0; ci0.d = ak.NaN;
  ci1.c.length = 0; ci1.d = ak.NaN;
 }
}

After this we update the cache entries of the clusters that were merged into using the updated distance matrix with the cacheDistances function defined in listing 10.

Listing 10: Reinitialising The Merged Clusters' Cache Entries
function cacheDistances(cache, mappings, memberships, dist) {
 var n = cache.length;
 var m = mappings.length;
 var i, j, mi, mi0, mi1, di0, di1, ci0, ci1, mj, di0j, cmp;

 for(i=0;i<m;++i) {
  mi = mappings[i];
  mi0 = mi[0];
  ci0 = cache[mi0];

  if(ci0.c.length===0) {
   di0 = dist[mi0];
   for(j=mi0+1;j<n;++j) {
    mj = memberships[j];
    if(mj>mi0) {
     di0j = di0[j];
     cmp = ak.floatCompare(di0j, ci0.d);
     if(cmp<0) {ci0.c = [mj]; ci0.d = di0j;}
     else if(cmp===0 && ci0.c.indexOf(mj)<0) ci0.c.push(mj);
    }
   }
  }
 }
}

Note that each cluster's cache entry should only contain clusters that come after it and so we only look at the distances to those data that belong to clusters with indices greater than its.

Finally, the cacheMappings function defined in listing 11 updates the cluster indices in the caches belonging to each cluster to reflect the merges.

Listing 11: Mapping The Cached Cluster Indices
function cacheMappings(cache, mappings) {
 var n = cache.length;
 var m = mappings.length;
 var i, j, mi, mi0, mi1, cj, k;

 for(i=0;i<m;++i) {
  mi = mappings[i];
  mi0 = mi[0];
  mi1 = mi[1];

  for(j=0;j<mi0;++j) {
   cj = cache[j].c;
   k = cj.indexOf(mi1);
   if(k>=0) {
    if(cj.indexOf(mi0)<0) cj[k] = mi0;
    else {cj[k] = cj[cj.length-1]; cj.pop();}
   }
  }
 }
}

These steps for updating the caches are coordinated by the updateCache function given in listing 12.

Listing 12: Updating The Cache
function updateCache(cache, mappings, memberships, dist) {
 cacheClear(cache, mappings);
 cacheDistances(cache, mappings, memberships, dist);
 cacheMappings(cache, mappings);
}

The ak.slinkClustering function defined in listing 13 puts all of this together to yield an implementation of the SLINK[7] algorithm, which generates single linkage hierarchical clusterings with a mere \(O\left(n^2\right)\) operations.

Listing 13: Our SLINK Implementation
ak.slinkClustering = function(data, dist) {
 var clusters = [];
 var t = ak.type(data);
 var n, memberships, i, c, cache, mappings;

 if(t===ak.MATRIX_T) {
  dist = data;
  n = dist.rows();
  if(dist.cols()!==n) {
   throw new Error('invalid distance matrix in ak.slinkClustering');
  }
 }
 else if(t===ak.ARRAY_T) {
  n = data.length;
  if(ak.type(dist)!==ak.MATRIX_T) dist = ak.distanceMatrix(data, data, dist);
  else if(dist.rows()!==n || dist.cols()!==n) {
   throw new Error('invalid distance matrix in ak.slinkClustering');
  }
 }
 else throw new Error('invalid data in ak.slinkClustering');

 dist = dist.toArray();
 memberships = new Array(n);
 for(i=0;i<n;++i) memberships[i] = i;
 c = ak.rawClustering(memberships);
 clusters.push(c);
 cache = initCache(dist);

 while(c.clusters.size()>2) {
  mappings = mergeMappings(cache, memberships);
  c = mergeClusters(mappings, memberships);
  clusters.push(c);
  if(c.clusters.size()>2) {
   updateDistances(dist, mappings);
   updateCache(cache, mappings, memberships, dist);
  }
 }
 if(c.clusters.size()===2) {
  for(i=0;i<n;++i) memberships[i] = 0;
  clusters.push(memberships);
 }
 return t===ak.MATRIX_T ? ak.rawClusterings(clusters)
                        : ak.rawClusterings(clusters, data);
};

Program 2 compares the result of ak.slinkClustering, which you can find in SlinkClustering.js, to that of ak.hierarchicalClustering using our single linkage distance function for the data set \(X\) that we used in program 1, finally constructing an ak.clusterings object from the former ak.rawClusterings object to give the memberships the same indices as the latter, making it easier to see that they are equivalent.

Program 2: General Versus SLINK

I think that we've spent quite enough time on the subject of clustering for now and so we'll take a look at something else in the next post.

References

[1] Some Of These Things Are Not Like The Others, www.thusspakeak.com, 2015.

[2] Making The Clade, www.thusspakeak.com, 2019.

[3] Racing Up The Hierarchy, www.thusspakeak.com, 2019.

[4] A Heap Of Stuff, www.thusspakeak.com, 2019.

[5] Cut Price Clusterings, www.thusspakeak.com, 2019.

[6] In The Neighbourhood, www.thusspakeak.com, 2019.

[7] Sibson, R. SLINK: an optimally efficient algorithm for the single-link cluster method, The Computer Journal, Vol 16, No. 1, British Computer Society, 1973.

Leave a comment