Making The Clade

| 0 Comments

We have so far seen a couple of schemes for identifying clusters in sets of data which are subsets whose members are in some sense similar to each other. Specifically, we have looked at the \(k\) means[1] and the shared near neighbours[2] algorithms which implicitly define clusters by the closeness of each datum to the average of each cluster and by their closeness to each other respectively.
Note that both of these algorithms use a heuristic, or rule of thumb, to assign data to clusters, but there's another way to construct clusterings; define a heuristic to measure how close to each other a pair of clusters are and then, starting with each datum in a cluster of its own, progressively merge the closest pairs until we end up with a single cluster containing all of the data. This means that we'll end up with a sequence of clusterings and so before we can look at such algorithms we'll need a structure to represent them.

Clusterings

The ak.clusterings function is used to create a sequence of clusterings using our usual technique of dispatching to members of a constructors object to initialise its state and providing the size, at and toArray methods to return the number of clusterings, the \(i^\mathrm{th}\) ak.clustering object[3] in the sequence and convert it into an array, respectively.

Listing 1: ak.clusterings
ak.CLUSTERINGS_T = 'ak.clusterings';

function Clusterings(){}

Clusterings.prototype = {
 TYPE: ak.CLUSTERINGS_T, valueOf: function(){return ak.NaN;}
};

ak.clusterings = function() {
 var c = new Clusterings();
 var state = [];
 var arg0 = arguments[0];
 var i;

 constructors[ak.nativeType(arg0)](state, arg0, arguments);

 c.size = function() {return state.length;};
 c.at = function(i) {return state[i];};
 c.toArray = function() {return state.slice(0);};

 return Object.freeze(c);
};

var constructors = {};

The constructors themselves allow ak.clusterings objects to be initialised with an array of clusterings or with an object having a size property or method and an at method and, in both cases, with an optional set of data, as illustrated by listing 2.

Listing 2: The ak.clusterings Constructors
constructors[ak.ARRAY_T] = function(state, arr, args) {
 var arg1 = args[1];
 constructors[ak.ARRAY_T][ak.nativeType(arg1)](state, arr, arg1);
};

constructors[ak.ARRAY_T][ak.OBJECT_T] = function(state, arr, data) {
 clusterings(state, arr, clusterData(data));
};

constructors[ak.ARRAY_T][ak.ARRAY_T] = function(state, arr, data) {
 clusterings(state, arr, clusterData(data));
};

constructors[ak.ARRAY_T][ak.UNDEFINED_T] = function(state, arr) {
 clusterings(state, arr, firstData(arr));
};

constructors[ak.OBJECT_T] = function(state, obj, args) {
 var arg1 = args[1];
 var n = obj.size;
 var arr, i;

 n = (ak.nativeType(n)===ak.FUNCTION_T) ? Number(n()) : Number(n);
 arr = new Array(n);
 for(i=0;i<n;++i) arr[i] = obj.at(i);
 constructors[ak.ARRAY_T][ak.nativeType(arg1)](state, arr, arg1);
};

Note that when initialising with an object we simply read its clusterings into an array and initialises it with that instead, passing on the optional data argument. If that argument is an array or an object then we use it to construct a cluster data object using the clusterData function and if it's undefined then we search for the first cluster data object in the clusterings with firstData. Finally, we initialise the sequence of clusterings by passing the array and the data to clusterings.

The clusterData function is simply a copy of the one used by ak.clustering and so creates cluster data objects that are indistinguishable from its. In particular we are setting the type of the object that it returns to ak.CLUSTER_DATA_T, dispatching to the constructors object to initialise it and giving it size, at, toArray, toString, toExponential, toFixed and toPrecision methods that return the number of data, the \(i^\mathrm{th}\) datum, an array of the data and string representations of them respectively, as shown by listing 3.

Listing 3: The clusterData Function
function ClusterData(){}

ClusterData.prototype = {
 TYPE: ak.CLUSTER_DATA_T, valueOf: function(){return ak.NaN;}
};

function clusterData(arg) {
 var data = new ClusterData();
 var state = [];

 constructors[ak.CLUSTER_DATA_T][ak.nativeType(arg)](state, arg);

 data.size = function()  {return state.length;};
 data.at   = function(i) {return state[i];};

 data.toArray  = function() {return state.slice(0);};
 data.toString = function() {return '{'+state.toString()+'}';};

 data.toExponential = function(d) {
  return '{'+state.map(function(x){return x.toExponential(d);})+'}';
 };

 data.toFixed = function(d) {
  return '{'+state.map(function(x){return x.toFixed(d);})+'}';
 };

 data.toPrecision = function(d) {
  return '{'+state.map(function(x){return x.toPrecision(d);})+'}';
 };

 return Object.freeze(data);
}

The constructors, given in listing 4, allow the cluster data to be initialised either from an array or from an object that supports a size property or method and an at method.

Listing 4: The clusterData Constructors
constructors[ak.CLUSTER_DATA_T] = {};

constructors[ak.CLUSTER_DATA_T][ak.ARRAY_T] = function(state, arr) {
 var n = arr.length;
 var i;

 state.length = n;
 for(i=0;i<n;++i) state[i] = arr[i];
};

constructors[ak.CLUSTER_DATA_T][ak.OBJECT_T] = function(state, obj) {
 var n = obj.size;
 var i;

 n = (ak.nativeType(n)===ak.FUNCTION_T) ? Number(n()) : Number(n);
 state.length = n;
 for(i=0;i<n;++i) state[i] = obj.at(i);
};

The firstData function simply iterates over the array of clusterings looking for the first containing data, constructing a clusterData object with it if it finds one and returning undefined if it doesn't, as shown by listing 5.

Listing 5: Finding The First Defined Data
function firstData(arr) {
 var n = arr.length;
 var i = 0;
 var t = ak.UNDEFINED_T;
 var data;

 while(i<n && t===ak.UNDEFINED_T) {
  t = ak.nativeType(data=arr[i++].data);
  if(t===ak.FUNCTION_T) t = ak.nativeType(data=data());
 }
 return t===ak.UNDEFINED_T ? data : clusterData(data);
}

Now that we have an array of clusterings and possibly a cluster data object to go with them we're ready to initialise the ak.clusterings object with the clusterings function defined in listing 6.

Listing 6: Initialising The Clusterings
function clusterings(state, arr, data) {
 var n = arr.length;
 var i;

 state.length = n;
 if(n>0) state[0] = clustering(arr[0], data);
 for(i=1;i<n;++i) {
  state[i] = clustering(arr[i], data);
  if(state[i].memberships.size()!==state[0].memberships.size()) {
   throw new Error('inconsistent memberships in ak.clusterings');
  }
 }
}

This simply constructs each clustering with the clustering function, defined in listing 7, checking that their memberships have the same size.

Listing 7: Constructing A Clustering
function Clustering(){}

Clustering.prototype = {
 TYPE: ak.CLUSTERING_T, valueOf: function(){return ak.NaN;}
};

function clustering(c, d) {
 var result;

 if(ak.nativeType(c)===ak.ARRAY_T) {
  c = ak.clustering(c);
 }
 else {
  if(!matchesData(c.data, d)) {
   throw new Error('mismatched data in ak.clusterings');
  }
  c = ak.clustering({memberships: c.memberships, clusters: c.clusters});
 }
 if(ak.nativeType(d)!==ak.UNDEFINED_T && d.size()!==c.memberships.size()) {
  throw new Error('data/memberships size mismatch in ak.clusterings');
 }

 result = new Clustering();
 result.memberships = c.memberships;
 result.clusters = c.clusters;
 result.data = d;
 return Object.freeze(result);
}

This duplicates the behaviour of ak.clustering by copying its memberships and clusters properties into a new object that has the ak.CLUSTERING_T type but, crucially, ensuring that every instance's data property is a reference to the same clusterData object, saving the expense of keeping multiple copies of it.
Of course, we need to be sure that the clustering c is consistent with that data, if defined, which we do with the matchesData function in listing 8 if it isn't an array, and by checking that the memberships of the ak.clustering that we construct with it is the same size.

Listing 8: Checking That The Data Are Consistent
function matchesData(d1, d0) {
 var t = ak.nativeType(d1);
 if(t===ak.FUNCTION_T) t = ak.nativeType(d1=d1());

 switch(t) {
  case ak.UNDEFINED_T: return true;
  case ak.ARRAY_T:     return matchesArray(d1, d0);
  case ak.OBJECT_T:    return matchesObject(d1, d0);
  default:             throw new Error('invalid data in ak.clusterings');
 }
}

Here we consider undefined data as consistent with any data and otherwise forward to the matchesArray or the matchesObject functions if the data being checked is an array or an object respectively, with any other type being an error. Note that it isn't possible for d0 to be undefined if d1 isn't since it would either have been defined explicitly in the ak.clusterings constructor or be set to the first defined data member of the clusterings that d1 was taken from.
Those functions simply check that the sizes of the data are the same and that they compare equal according to our overloaded ak.eq function, as shown by listing 9.

Listing 9: Matching Arrays And Objects To The Data
function matchesArray(d1, d0) {
 var n0 = d0.size();
 var i = 0;

 if(d1.length!==n0) return false;
 while(i<n0 && ak.eq(d1[i], d0.at(i))) ++i;
 return i===n0;
}

function matchesObject(d1, d0) {
 var n0 = d0.size();
 var n1 = d1.size;
 var i = 0;

 n1 = ak.nativeType(n1)===ak.FUNCTION_T ? Number(n1()) : Number(n1);
 if(n1!==n0) return false;
 while(i<0 && ak.eq(d1.at(i), d0.at(i))) ++i;
 return i===n0;
}

Program 1 demonstrates the how to use the ak.clusterings function to create a sequence of clustering from a set of arrays containing the ids of the clusters that each datum belongs to.

Program 1: Using ak.clusterings

Note that you can find the definition of ak.clusterings in Clusterings.js.

Clades

If the clusterings were constructed by successive merging of clusterings then they will form a hierarchy, in the sense that if two data are in the same cluster at one level then they must also be in the same cluster in the next. We can represent such hierarchies with trees in which the leaf nodes contain single elements and each node's parents contain the union of the elements in their children.
For example, figure 1 shows the tree associated with the clusterings in program 1.

Figure 1: A Hierarchy Of Clusterings
0, 1, 2, 3, 4, 5, 6, 7
0, 2, 3, 6 1, 4, 5, 7
0, 2, 3, 6 1, 4 5, 7
0, 2, 3, 6 1, 4 5 7
0, 3 2, 6 1, 4 5 7
0, 3 2 6 1, 4 5 7
0 3 2 6 1 4 5 7

Trees of this form are frequently used in biology to represent the evolutionary relationships between species and their common ancestors. In this context they are known as clades, for which I have named the ak.clade type defined in listing 10.

Listing 10: ak.clade
ak.CLADE_T = 'ak.clade';

function Clade(){}

Clade.prototype = {
 TYPE: ak.CLADE_T, valueOf: function(){return ak.NaN;}
};

ak.clade = function(clusterings) {
 var n, clustering, clusters, cluster;

 if(ak.type(clusterings)!==ak.CLUSTERINGS_T) {
  throw new Error('invalid argument in ak.clade');
 }

 n = clusterings.size();
 if(!(n>0)) throw new Error('empty clusterings in ak.clade');

 clustering = clusterings.at(n-1);
 clusters = clustering.clusters;
 if(clusters.size()!==1) {
  throw new Error('non-hierarchical clusterings in ak.clade');
 }
 cluster = clusters.at(0);
 return clade(clusterings, n-1, cluster, clustering.data, new Array(cluster.size()));
};

This first checks that its argument is a non-empty ak.clusterings object whose last ak.clustering has a single cluster before forwarding to the clade helper function, given in listing 11, to construct the clade.

Listing 11: The clade Helper Function
function clade(clusterings, level, cluster, data, check, parent) {
 var c = new Clade();
 var clades, n, ni, nj, i, j, k;

 c.cluster = cluster;
 c.data = data;
 c.parent = parent;

 if(level>0) {
  clades = children(clusterings, level-1, cluster, data, check, c);

  n = check.length;
  for(i=0;i<n;++i) check[i] = false;

  ni = cluster.size();
  for(i=0;i<ni;++i) {
   k = cluster.at(i);
   if(k<0 || k>=n) throw new Error('invalid cluster in ak.clade');
   check[k] = true;
  }

  ni = clades.size();
  for(i=0;i<ni;++i) {
   cluster = clades.at(i).cluster;
   nj = cluster.size();
   for(j=0;j<nj;++j) {
    k = cluster.at(j);
    if(k<0 || k>=n) throw new Error('invalid cluster in ak.clade');
    if(!check[k]) throw new Error('non-hierarchical clusterings in ak.clade');
    check[k] = false;
   }
  }

  c.children = clades;
  c.toArray  = clades.toArray;
  c.toString = function() {return toStringArray(c, []).join('');};
 }
 else {
  c.toArray  = cluster.toArray;
  c.toString = cluster.toString;
 }

 return Object.freeze(c);
}

This function first sets the clade's associated cluster, data and parent to those passed to it.
Then, if the clade isn't a leaf node then the level will be positive and we find its child clades with the unimaginatively named children function. To test that the clusterings represent a valid hierarchy we first set each element in check to false, then set those at the cluster member's ids to true, throwing if they are less than zero or not less than the last cluster's size. Then we iterate over the clusters associated with the children, throwing if any of their member's ids are outside of the valid range or if they contain a member whose id wasn't in the check list, setting its values to false as we go to ensure that we can tell if a member turns up twice in the children's clusters. If the cluster passes these tests then we set the clade's children to clades, its toArray method to that of the clades and define its toString method using the helper function toStringArray.
Finally, if the clade is a leaf node then we simply make its toArray and toString methods equal to those of its cluster.

The children function is defined in listing 12 and creates an object of type ak.CLADES_T.

Listing 12: The children Function
ak.CLADES_T = 'ak.clades';

function Clades(){}

Clades.prototype = {
 TYPE: ak.CLADES_T, valueOf: function(){return ak.NaN;}
};

function children(clusterings, level, cluster, data, check, parent) {
 var c = new Clades();
 var a = [];
 var clustering = clusterings.at(level);
 var memberships = clustering.memberships;
 var clusters = clustering.clusters;
 var n = cluster.size();
 var i;

 for(i=0;i<n;++i) a.push(memberships.at(cluster.at(i)));
 a.sort(ak.numberCompare);
 n = ak.unique(a, ak.numberEqual)
 a = a.slice(0, n);
 for(i=0;i<n;++i) {
  a[i] = clade(clusterings, level, clusters.at(a[i]), data, check, parent);
 }

 c.size = function() {return n;};
 c.at = function(i) {return a[i];};

 c.toArray = function() {
  var arr = new Array(n);
  var i;
  for(i=0;i<n;++i) arr[i] = a[i].toArray();
  return arr;
 };

 c.toString = function() {
  var arr = [];
  var i;
  arr.push('{');
  for(i=0;i<n;++i) {
   toStringArray(a[i], arr);
   if(i!==n-1) arr.push(',');
  }
  arr.push('}');
  return arr.join('');
 };

 return Object.freeze(c);
}

This function begins by creating a list of the clusters in the current level that the members of the parent cluster belong to, then sorting them by our ak.numberCompare function[4] and removing any duplicates with ak.unique using the ak.numberEqual equality comparison[5]. It then replaces each with its associated clade by recursively calling the clade function.
Finally, it adds size and at access methods to the ak.clades object and the toArray and toString conversion methods. The former simply recursively creates nested sets of elements by populating an array with the array conversions of the clades. The latter also uses the toStringArray function, given in listing 13, to recursively represent those sets with comma delimited, curly brace enclosed strings, building up its constituent sub-strings in the arr argument and finally joining them when it's complete.

Listing 13: The toStringArray Function
function toStringArray(clade, arr) {
 var children = clade.children;
 var n, i;
 if(ak.type(children)===ak.UNDEFINED_T) {
  arr.push(clade.cluster.toString());
 }
 else {
  n = children.size();
  arr.push('{');
  for(i=0;i<n;++i) {
   toStringArray(children.at(i), arr);
   if(i!==n-1) arr.push(',');
  }
  arr.push('}');
 }
 return arr;
}

Figure 2 shows the result of manually representing a clade with such a string, building it up from the leaves to the root with comma separated lists of each node's children enclosed by curly braces.

Figure 2: A Clade As Nested Sets
{{{{{{{0}, {3}}}, {{{2}}, {{6}}}}}}, {{{{{{1}, {4}}}}}, {{{{{5}}}}, {{{{7}}}}}}}
{{{{{{0}, {3}}}, {{{2}}, {{6}}}}}} {{{{{{1}, {4}}}}}, {{{{{5}}}}, {{{{7}}}}}}
{{{{{0}, {3}}}, {{{2}}, {{6}}}}} {{{{{1}, {4}}}}} {{{{{5}}}}, {{{{7}}}}}
{{{{0}, {3}}}, {{{2}}, {{6}}}} {{{{1}, {4}}}} {{{{5}}}} {{{{7}}}}
{{{0}, {3}}} {{{2}}, {{6}}} {{{1}, {4}}} {{{5}}} {{{7}}}
{{0}, {3}} {{2}} {{6}} {{1}, {4}} {{5}} {{7}}
{0} {3} {2} {6} {1} {4} {5} {7}

Program 2 demonstrates how to construct an ak.clade and the result of converting it and its child clades to strings in a top to bottom then left to right order, which you can compare to the nodes in figure 2 to satisfy yourself that it does so correctly.

Program 2: Using ak.clade

Finally, you can find the definition of ak.clade in Clade.js.

References

[1] K Means Business, www.thusspakeak.com, 2015.

[2] Nearest And Dearest, www.thusspakeak.com, 2019.

[3] Some Of These Things Are Not Like The Others, www.thusspakeak.com, 2015.

[4] We're All Sorted From A To Z, www.thusspakeak.com, 2017.

[5] Let's Talk About Sets, www.thusspakeak.com, 2018.

Leave a comment