Cut Price Clusterings

| 0 Comments

Last month[1] we saw how we could efficiently generate hierarchical clusterings[2], which are sequences of sets of clusters[3], which are themselves subsets of a set of data that each contain elements that are similar to each other, such that if a pair of data are in the same clustering at one step then they must be in the same clustering in the next which will always be the case if we move from one step to the next by merging the closest pairs of clusters. Specifically, we used our ak.minHeap implementation of the min-heap structure[4] to cache the distances between clusters, saving us the expense of recalculating them for clusters that don't change from one step in the hierarchy to the next.
Recall that we used three different schemes for calculating the distance between a pair of clusters, the average distance between their members, known as average linkage, the distance between their closest members, known as single linkage, and the distance between their farthest members, known as complete linkage, and that I concluded by noting that our algorithm was about as efficient as possible in general but that there is a much more efficient scheme for single linkage clusterings; efficient enough that sorting the clusters in each clustering by size would be the most costly operation and so in this post we shall implement objects to represent clusterings that don't do that.

Building Unsorted Clustering Objects

In order to do this smoothly we shall need to reproduce the functionality of the ak.clustering, ak.clusterData, ak.clusterMemberships, ak.cluster and ak.clusters objects to represent the clustering, the data, their cluster memberships, the data within each cluster and the set of clusters respectively, whose types are redefined in listing 1.

Listing 1: The Cluster Object Types
ak.CLUSTERING_T = 'ak.clustering';
ak.CLUSTER_DATA_T = 'ak.clusterData';
ak.CLUSTER_MEMBERSHIPS_T = 'ak.clusterMemberships';
ak.CLUSTER_T = 'ak.cluster';
ak.CLUSTERS_T = 'ak.clusters';

Listing 2 shows how we shall do this for the cluster data with the rawClusterData object which is distinguished from the clusterData object by having a SUB property of 'raw'.

Listing 2: The Raw Cluster Data
function RawClusterData(){}
RawClusterData.prototype = {
 TYPE: ak.CLUSTER_DATA_T, SUB: 'raw', valueOf: function(){return ak.NaN;}
};

function rawClusterData(arg) {
 var data = new RawClusterData();
 var state = [];

 constructors[ak.CLUSTER_DATA_T][ak.nativeType(arg)](state, arg);

 data.size = function()  {return state.length;};
 data.at   = function(i) {return state[i];};

 data.toArray  = function() {return state.slice(0);};
 data.toString = function() {return '{'+state.toString()+'}';};

 data.toExponential = function(d) {
  return '{'+state.map(function(x){return x.toExponential(d);})+'}';
 };

 data.toFixed = function(d) {
  return '{'+state.map(function(x){return x.toFixed(d);})+'}';
 };

 data.toPrecision = function(d) {
  return '{'+state.map(function(x){return x.toPrecision(d);})+'}';};
 }

 return Object.freeze(data);
}

As with the original, this defers to a constructors object to initialise its state and defines the size and at methods to retrieve its size and access its members together with the toArray, toString, toExponential, toFixed and toPrecision methods to convert it to an array and to the various JavaScript string representations of numbers respectively.
Listing 3 gives its constructors which simply copy the elements of their arguments into the state array.

Listing 3: The Cluster Data Constructors
var constructors = {};
constructors[ak.CLUSTER_DATA_T] = {};

constructors[ak.CLUSTER_DATA_T][ak.ARRAY_T] = function(state, arr) {
 var n = arr.length;
 var i;

 state.length = n;
 for(i=0;i<n;++i) state[i] = arr[i];
};

constructors[ak.CLUSTER_DATA_T][ak.OBJECT_T] = function(state, obj) {
 var n = obj.size;
 var i;

 n = (ak.nativeType(n)===ak.FUNCTION_T) ? Number(n()) : Number(n);
 state.length = n;
 for(i=0;i<n;++i) state[i] = obj.at(i);
};

Next we have the raw cluster memberships, defined in listing 4, which also have a SUB property of 'raw' to distinguish them from the original type.

Listing 4: The Raw Cluster Memberships
function RawClusterMemberships(){}
RawClusterMemberships.prototype = {
 TYPE: ak.CLUSTER_MEMBERSHIPS_T, SUB: 'raw', valueOf: function(){return ak.NaN;}
};

function rawClusterMemberships(arg) {
 var m = new RawClusterMemberships();
 var state = [];

 constructors[ak.CLUSTER_MEMBERSHIPS_T][ak.nativeType(arg)](state, arg);

 m.size = function()  {return state.length;};
 m.at   = function(i) {return state[i];};

 m.toArray  = function() {return state.slice(0);};
 m.toString = function() {
  return '{'+state.map(function(x){return x.toFixed(0);})+'}';
 };

 return Object.freeze(m);
}

Listing 5 gives its constructors which also simply copy the elements of its arguments into its state, coercing them to numbers since they should be the indices of the cluster that each datum is a member of.

Listing 5: The Cluster Memberships Constructors
constructors[ak.CLUSTER_MEMBERSHIPS_T] = {};

constructors[ak.CLUSTER_MEMBERSHIPS_T][ak.ARRAY_T] = function(state, arr) {
 var n = arr.length;
 var i;

 state.length = n;
 for(i=0;i<n;++i) state[i] = Number(arr[i]);
};

constructors[ak.CLUSTER_MEMBERSHIPS_T][ak.OBJECT_T] = function(state, obj) {
 var n = obj.size;
 var i;

 n = (ak.nativeType(n)===ak.FUNCTION_T) ? Number(n()) : Number(n);
 state.length = n;
 for(i=0;i<n;++i) state[i] = Number(obj.at(i));
};

The implementation of cluster objects, containing the indices of the data that are members of them, is almost identical, as shown by listing 6.

Listing 6: The Raw Cluster
function RawCluster(){}
RawCluster.prototype = {
 TYPE: ak.CLUSTER_T, SUB: 'raw', valueOf: function(){return ak.NaN;}
};

function rawCluster(arg) {
 var c = new RawCluster();
 var state = [];

 constructors[ak.CLUSTER_T][ak.nativeType(arg)](state, arg);

 c.size = function()  {return state.length;};
 c.at   = function(i) {return state[i];};

 c.toArray  = function() {return state.slice(0);};
 c.toString = function() {
  return '{'+state.map(function(x){return x.toFixed(0);})+'}';
 };

 return Object.freeze(c);
}

Its constructors are also virtually the same, as illustrated by listing 7.

Listing 7: The Cluster Constructors
constructors[ak.CLUSTER_T] = {};

constructors[ak.CLUSTER_T][ak.ARRAY_T] = function(state, arr) {
 var n = arr.length;
 var i;

 state.length = n;
 for(i=0;i<n;++i) state[i] = Number(arr[i]);
};

constructors[ak.CLUSTER_T][ak.OBJECT_T] = function(state, obj) {
 var n = obj.size;
 var i;

 n = (ak.nativeType(n)===ak.FUNCTION_T) ? Number(n()) : Number(n);
 state.length = n;
 for(i=0;i<n;++i) state[i] = Number(obj.at(i));
};

The rawClusters type given in listing 8 implements a collection of rawCluster object with the usual size and at methods and toArray and toString conversions.

Listing 8: The Raw Clusters
function RawClusters(){}
RawClusters.prototype = {
 TYPE: ak.CLUSTERS_T, SUB: 'raw', valueOf: function(){return ak.NaN;}
};

function rawClusters(arg) {
 var c = new RawClusters();
 var state = [];

 constructors[ak.CLUSTERS_T][ak.nativeType(arg)](state, arg);

 c.size = function()  {return state.length;};
 c.at   = function(i) {return state[i];};

 c.toArray  = function() {return state.map(function(x){return x.toArray();});};
 c.toString = function() {
  return '{'+state.map(function(x){return x.toString();})+'}';
 };

 return Object.freeze(c);
}

Its constructors simply iterate over the elements of their argument populating its state with rawCluster objects constructed with them, as shown by listing 9.

Listing 9: The Clusters Constructors
constructors[ak.CLUSTERS_T] = {};

constructors[ak.CLUSTERS_T][ak.ARRAY_T] = function(state, arr) {
 var n = arr.length;
 var i;

 state.length = n;
 for(i=0;i<n;++i) state[i] = rawCluster(arr[i]);
};

constructors[ak.CLUSTERS_T][ak.OBJECT_T] = function(state, obj) {
 var n = obj.size;
 var i;

 n = (ak.nativeType(n)===ak.FUNCTION_T) ? Number(n()) : Number(n);
 state.length = n;
 for(i=0;i<n;++i) state[i] = rawCluster(obj.at(i));
};

Everything so far has been trivial since the type to represent the unsorted clustering will be responsible for ensuring that the clusters' elements and the data's memberships are consistent, although its definition, given in listing 10, is pretty simple.

Listing 10: The Raw Clustering
function RawClustering(){}
RawClustering.prototype = {
 TYPE: ak.CLUSTERING_T, SUB: 'raw', valueOf: function(){return ak.NaN;}
};

ak.rawClustering = function() {
 var c = new RawClustering();
 var arg0 = arguments[0];

 constructors[ak.nativeType(arg0)](c, arg0, arguments);
 return Object.freeze(c);
};

The real work takes place in the constructors, the first of which is for an array argument and is defined in listing 11.

Listing 11: The Clustering Array Constructor
constructors[ak.ARRAY_T] = function(state, arr, args) {
 var data = args[1];

 arr = arr.slice(0)
 if(ak.nativeType(arr[0])!==ak.ARRAY_T) fromMemberships(state, arr);
 else fromClusters(state, arr);

 if(ak.nativeType(data)!==ak.UNDEFINED_T) {
  state.data = rawClusterData(data);
  if(state.data.size()!==state.memberships.size()) {
   throw new Error('data/memberships size mismatch in ak.rawClustering');
  }
 }
};

Here we're assuming that the elements of arr are the identities of the clusters the data belong to if they're not arrays and that they're array of the indices of the data belonging to each cluster otherwise. If the second argument is defined then we assume that it's the data that's being clustered and, after constructing a rawClusterData object with it, require that it has the same size as the cluster memberships.

The second constructor takes an object as an argument and assumes that is has clusters and/or a memberships properties or functions that define the clusters and the data's memberships respectively. If both are defined then we check that they are consistent with the validClustering helper function, again checking that the data are also consistent with the memberships if they are provided.

Listing 12: The Clustering Object Constructor
constructors[ak.OBJECT_T] = function(state, obj) {
 var c = obj.clusters;
 var m = obj.memberships;
 var d = obj.data;

 if(ak.nativeType(c)===ak.FUNCTION_T) c = c();
 if(ak.nativeType(m)===ak.FUNCTION_T) m = m();
 if(ak.nativeType(d)===ak.FUNCTION_T) d = d();

 if(ak.nativeType(c)!==ak.UNDEFINED_T) c = rawClusters(c);
 if(ak.nativeType(m)!==ak.UNDEFINED_T) m = rawClusterMemberships(m);

 if((!c && !m) || (c && m && !validClustering(c, m))) {
  throw new Error('invalid clustering object in ak.rawClustering');
 }

 if(m) fromMemberships(state, m.toArray());
 else  fromClusters(state, c.toArray());

 if(ak.nativeType(d)!==ak.UNDEFINED_T) {
  state.data = rawClusterData(d);
  if(state.data.size()!==state.memberships.size()) {
   throw new Error('data/memberships size mismatch in ak.rawClustering');
  }
 }
};

When constructing from the cluster memberships of the data we don't want the clusters' identities to have gaps and so we create a mapping from those that they have been given to their first appearance in the memberships array m using the ids object in the fromMemberships function defined in listing 13.

Listing 13: Constructing From Memberships
function fromMemberships(state, m) {
 var n = m.length;
 var ids = {};
 var id = 0;
 var i, mi, c;

 for(i=0;i<n;++i) {
  mi = Number(m[i]);
  if(mi!==ak.floor(mi) || mi<0) {
   throw new Error('invalid membership in ak.rawClustering');
  }
  mi = '#'+mi.toFixed(0);
  if(isNaN(ids[mi])) ids[mi] = id++;
 }
 for(i=0;i<n;++i) m[i] = ids['#'+m[i].toFixed(0)];

 c = new Array(id);
 for(i=0;i<id;++i) c[i] = [];
 for(i=0;i<n;++i) c[m[i]].push(i);

 state.clusters = rawClusters(c);
 state.memberships = rawClusterMemberships(m);
}

This uses JavaScript's array-like object member access using strings to associate each member's cluster identifier with the first position that it occurs in, replacing them with those positions and finally creating an array of arrays containing the indices of the data that belong to each cluster to initialise the state object's clusters.

When constructing from clusters we first check that every element of the c array is an array, adding together their lengths as we do so. Then we iterate over the clusters to build the memberships array, checking that their elements are valid indices of the data and that no datum appears in more than one cluster, as shown by listing 14.

Listing 14: Constructing From Clusters
function fromClusters(state, c) {
 var nc = c.length;
 var nm = 0;
 var i, j, m, n, ci, cij;

 for(i=0;i<nc;++i) {
  ci = c[i];
  if(ak.nativeType(ci)!==ak.ARRAY_T) {
   throw new Error('invalid cluster in ak.rawClustering');
  }
  nm += ci.length;
 }
 m = new Array(nm);

 for(i=0;i<nc;++i) {
  ci = c[i];
  n = ci.length;
  for(j=0;j<n;++j) {
   cij = Number(ci[j]);
   if(cij!==ak.floor(cij) || cij<0 || cij>=nm || !isNaN(m[cij])) {
    throw new Error('invalid member in ak.rawClustering');
   }
   m[cij] = i;
  }
 }

 state.clusters = rawClusters(c);
 state.memberships = rawClusterMemberships(m);
}

Finally, to check that the clusters and memberships in the object constructor are consistent we first check that the latter contain valid cluster indices and populate the assigned array with false values so that we can check that each only appears in a single cluster whilst we iterate over them checking that they each contain valid data indices, as demonstrated by listing 15.

Listing 15: Checking The Clustering
function validClustering(c, m) {
 var nc = c.size();
 var nm = m.size();
 var assigned = new Array(nm);
 var i, j, mi, ci, cin, cij;

 for(i=0;i<nm;++i) {
  mi = m.at(i);
  if(mi!==ak.floor(mi) || mi<0 || mi>=nc) return false;
  assigned[i] = false;
 }

 for(i=0;i<nc;++i) {
  ci = c.at(i);
  cin = ci.size();
  for(j=0;j<cin;++j) {
   cij = ci.at(j);
   if(cij!==ak.floor(cij) || cij<0 || cij>nm || m.at(cij)!==i || assigned[cij]) {
    return false;
   }
   assigned[cij] = true;
  }
 }

 return true;
}

Note that you can find all of these functions in RawClustering.js.

Building Unsorted Clusterings Objects

Now since our ak.clusterings type converts the elements of its constructor array argument to ak.clustering objects we'll need to create a replacement for it that converts them to ak.rawClustering objects instead, starting by redefining its type in listing 16.

Listing 16: The Clusterings Type
ak.CLUSTERINGS_T = 'ak.clusterings';

As before, we shan't want to keep multiple copies of the clustered data and so recreate the rawClusterData type so that we can give every clustering in the clusterings a reference to the same object. This is identical to the one we defined in listing 2, as shown by listing 17.

Listing 17: The Raw Cluster Data
function RawClusterData(){}
RawClusterData.prototype = {
 TYPE: ak.CLUSTER_DATA_T, SUB: 'raw', valueOf: function(){return ak.NaN;}
};

function rawClusterData(arg) {
 var data = new RawClusterData();
 var state = [];

 constructors[ak.CLUSTER_DATA_T][ak.nativeType(arg)](state, arg);

 data.size = function()  {return state.length;};
 data.at   = function(i) {return state[i];};

 data.toArray  = function() {return state.slice(0);};
 data.toString = function() {return '{'+state.toString()+'}';};

 data.toExponential = function(d) {
  return '{'+state.map(function(x){return x.toExponential(d);})+'}';};
 }

 data.toFixed = function(d) {
  return '{'+state.map(function(x){return x.toFixed(d);})+'}';
 };

 data.toPrecision = function(d) {
  return '{'+state.map(function(x){return x.toPrecision(d);})+'}';
 };

 return Object.freeze(data);
}

Listing 18 shows that its constructors are also the same as those implemented in listing 3.

Listing 18: The Cluster Data Constructors
var constructors = {};
constructors[ak.CLUSTER_DATA_T] = {};

constructors[ak.CLUSTER_DATA_T][ak.ARRAY_T] = function(state, arr) {
 var n = arr.length;
 var i;

 state.length = n;
 for(i=0;i<n;++i) state[i] = arr[i];
};

constructors[ak.CLUSTER_DATA_T][ak.OBJECT_T] = function(state, obj) {
 var n = obj.size;
 var i;

 n = (ak.nativeType(n)===ak.FUNCTION_T) ? Number(n()) : Number(n);
 state.length = n;
 for(i=0;i<n;++i) state[i] = obj.at(i);
};

The ak.rawClusterings type defined in listing 19 is the replacement for ak.clusterings and also has a SUB property of 'raw' to allow them to be distinguished from each other.

Listing 19: The Raw Clusterings
function RawClusterings(){}
RawClusterings.prototype = {
 TYPE: ak.CLUSTERINGS_T, SUB: 'raw', valueOf: function(){return ak.NaN;}
};

ak.rawClusterings = function() {
 var c = new RawClusterings();
 var state = [];
 var arg0 = arguments[0];
 var i;

 constructors[ak.nativeType(arg0)](state, arg0, arguments);

 c.size = function() {return state.length;};
 c.at = function(i) {return state[i];};
 c.toArray = function() {return state.slice(0);};

 return Object.freeze(c);
};

Its constructors, given in listing 20, allow us to initialise the clusterings with an array or an object containing objects representing each clustering and an optional second argument representing the clustered data.

Listing 20: The Clusterings Constructors
constructors[ak.ARRAY_T] = function(state, arr, args) {
 var arg1 = args[1];
 constructors[ak.ARRAY_T][ak.nativeType(arg1)](state, arr, arg1);
};

constructors[ak.ARRAY_T][ak.OBJECT_T] = function(state, arr, data) {
 rawClusterings(state, arr, rawClusterData(data));
};

constructors[ak.ARRAY_T][ak.ARRAY_T] = function(state, arr, data) {
 rawClusterings(state, arr, rawClusterData(data));
};

constructors[ak.ARRAY_T][ak.UNDEFINED_T] = function(state, arr) {
 rawClusterings(state, arr, firstData(arr));
};

constructors[ak.OBJECT_T] = function(state, obj, args) {
 var arg1 = args[1];
 var n = obj.size;
 var arr, i;

 n = (ak.nativeType(n)===ak.FUNCTION_T) ? Number(n()) : Number(n);
 arr = new Array(n);
 for(i=0;i<n;++i) arr[i] = obj.at(i);
 constructors[ak.ARRAY_T][ak.nativeType(arg1)](state, arr, arg1);
};

Note that if the first argument is an object then we assume that it has a size property or method giving the number of clusterings and an at method to access them which we use to read them into an array and forward to the array constructors. Furthermore, if the data argument is undefined then we search for the first clustering containing data using the firstData function defined in listing 21.

Listing 21: Finding The Cluster Data
function firstData(arr) {
 var n = arr.length;
 var i = 0;
 var t = ak.UNDEFINED_T;
 var data;

 while(i<n && t===ak.UNDEFINED_T) {
  t = ak.nativeType(data=arr[i++].data);
  if(t===ak.FUNCTION_T) t = ak.nativeType(data=data());
 }
 return t===ak.UNDEFINED_T ? data : rawClusterData(data);
}

Once the data, if defined, have been identified the constructors call the rawClusterings function given in listing 22 to initialise the state.

Listing 22: Initialising The Clusterings
function rawClusterings(state, arr, data) {
 var n = arr.length;
 var i;

 state.length = n;
 if(n>0) state[0] = rawClustering(arr[0], data);
 for(i=1;i<n;++i) {
  state[i] = rawClustering(arr[i], data);
  if(state[i].memberships.size()!==state[0].memberships.size()) {
   throw new Error('inconsistent memberships in ak.rawClusterings');
  }
 }
}

This simply iterates over the array, calling the rawClustering function defined in listing 23 to create the clusterings and checks that their memberships are the same size.

Listing 23: Creating A Clustering
function RawClustering(){}
RawClustering.prototype = {
 TYPE: ak.CLUSTERING_T, SUB: 'raw', valueOf: function(){return ak.NaN;}
};

function rawClustering(c, d) {
 var result;

 if(ak.nativeType(c)===ak.ARRAY_T) {
  c = ak.rawClustering(c);
 }
 else {
  if(!matchesData(c.data, d)) {
   throw new Error('mismatched data in ak.rawClusterings');
  }
  c = ak.rawClustering({memberships: c.memberships, clusters: c.clusters});
 }
 if(ak.nativeType(d)!==ak.UNDEFINED_T && d.size()!==c.memberships.size()) {
  throw new Error('data/memberships size mismatch in ak.rawClusterings');
 }

 result = new RawClustering();
 result.memberships = c.memberships;
 result.clusters = c.clusters;
 result.data = d;
 return Object.freeze(result);
}

Here we're creating an ak.rawClustering object from c and copying its memberships and clusters into a new object with the same type and a SUB property of 'raw', giving it a reference to the common cluster data d and, if c isn't an array, checking that its data property is consistent with that common data using the matchesData function defined in listing 24.

Listing 24: Testing Data Consistency
function matchesData(d1, d0) {
 var t = ak.nativeType(d1);
 if(t===ak.FUNCTION_T) t = ak.nativeType(d1=d1());

 switch(t) {
  case ak.UNDEFINED_T: return true;
  case ak.ARRAY_T:     return matchesArray(d1, d0);
  case ak.OBJECT_T:    return matchesObject(d1, d0);
  default:             throw new Error('invalid data in ak.rawClusterings');
 }
}

This treats undefined data as being consistent with any data and otherwise forwards to the matchesArray and matchesObject functions, given in listing 25, for data represented by arrays and objects respectively, throwing an exception for any other type.

Listing 25: Checking Array And Object Data
function matchesArray(d1, d0) {
 var n0 = d0.size();
 var i = 0;

 if(d1.length!==n0) return false;
 while(i<n0 && ak.eq(d1[i], d0.at(i))) ++i;
 return i===n0;
}

function matchesObject(d1, d0) {
 var n0 = d0.size();
 var n1 = d1.size;
 var i = 0;

 n1 = ak.nativeType(n1)===ak.FUNCTION_T ? Number(n1()) : Number(n1);
 if(n1!==n0) return false;
 while(i<n0 && ak.eq(d1.at(i), d0.at(i))) ++i;
 return i===n0;
}

These both simply check that the data are the same size and then iteratively compare them using our overloaded ak.eq equality comparison.

Now that we've have this less computationally expensive way to represent hierarchical clusterings, which can be found in RawClustering.js, we're ready to take a look at the more efficient algorithm for the generation of single linkage hierarchical clusterings and we shall do so in the next post.

References

[1] Racing Up The Hierarchy, www.thusspakeak.com, 2019.

[2] Making The Clade, www.thusspakeak.com, 2019.

[3] Some Of These Things Are Not Like The Others, www.thusspakeak.com, 2015.

[4] A Heap Of Stuff, www.thusspakeak.com, 2019.

Leave a comment