Some Of These Things Are Not Like The Others

| 0 Comments

Figure 1: Bleedin' Obvious Innit?

Figure 2: Perhaps Not!
Of all of the problems that we might attack with numerical methods, one of my favourites is cluster analysis; the task of dividing a set of data into subsets of similar data. Now I'm fully aware that my having favourite applications of numerical methods might lead you to suspect that I live at the most cringingly awkward end of nerd street, but please bear with me; the reason that I find cluster analysis so appealing, and that I hope you will too, is that it is a rare example of a mathematical problem that we struggle to even properly define.

I Knows One When I Sees One

The difficult thing about classifying data by similarity is that whilst it is subjectively obvious it's rather hard to pin down mathematically. For example, we might look at the data points in figure 1 and conclude that a good cluster is one in which the distances between its members are smaller than the distances between them and the members of other clusters. If we were to divide the points into two clusters then we should expect such a measure to be minimised by the obvious clustering of the points at the top left versus those at the bottom right.
Unfortunately, this isn't a particularly robust scheme. For example, if we took hold of the bottom left and top right corners of a graph like figure 1 and stretched it we might transform it into one like figure 2 in which the obvious clusters do not minimise the within versus the between cluster distances.
Even if we accept that the clusters identified by this scheme won't necessarily be those that we would choose, finding the best clustering according to it is a rather computationally expensive task since there are a whopping \(k^n\) ways to assign \(n\) points to \(k\) clusters. Now many of these will have identical memberships but with different labellings, so with careful bookkeeping we can reduce this to \(\frac{1}{k!}k^n\), where the exclamation mark stands for the factorial. This is still a huge number though, so doesn't really improve our prospects.

Why? Because I Say So!

Given these problems, clustering algorithms don't typically work by directly minimising some measure of cluster inefficiency but rather update the cluster memberships according to some heuristic scheme based upon some intuitive notion of similarity. So rather than trying to define exactly what constitutes a cluster we simply define them as that which are identified by this algorithm. Circular reasoning like this is extremely unusual in mathematics and the fact that clustering algorithms employ it is precisely the reason why I think that they are so interesting!

Unfortunately, there's some rather mundane preparation that we must perform before we can embark upon the unbridled thrillfest that I'm sure you shall find our first foray into cluster analysis to be. Specifically we shall need to implement some classes to represent the results of clustering algorithms.

Cluster Memberships

The first way that we might want to represent a clustering is with a mapping from the indices of the data to the indices of the clusters that they are members of, as implemented by the clusterMemberships class given in listing 1.

Listing 1: The clusterMemberships Class
var constructors = {};

ak.CLUSTER_MEMBERSHIPS_T = 'ak.clusterMemberships';

function ClusterMemberships(){}
ClusterMemberships.prototype = {
 TYPE: ak.CLUSTER_MEMBERSHIPS_T, valueOf: function(){return ak.NaN;}
};

function clusterMemberships(arg) {
 var m = new ClusterMemberships();
 var state = [];

 constructors[ak.CLUSTER_MEMBERSHIPS_T][ak.nativeType(arg)](state, arg);

 m.size = function()  {return state.length;};
 m.at   = function(i) {return state[i];};

 m.toArray  = function() {return state.slice(0);};
 m.toString = function() {return '{'+state.map(function(x){return x.toFixed(0);})+'}';};

 return Object.freeze(m);
}

constructors[ak.CLUSTER_MEMBERSHIPS_T] = {};

This more or less follows our usual scheme for objects; there's an object prototype that defines the TYPE member used by our dispatch mechanism[1] and disables default conversion to native types, a size method to give read only access to the number of data, an at method to give read only access to the cluster IDs that each of them belong to and explicit type conversion methods toArray and toString.
However, it is not a member of ak and so users cannot directly create instances of this class. Furthermore, it dispatches construction to constructors[ak.CLUSTER_MEMBERSHIPS_T] rather than constructors itself. Both of these departures from our typical approach are a consequence of the fact that this class is intended to represent a component member of the class that we shall ultimately use to represent a clustering.

We shall support construction from arrays of cluster IDs and objects following the naming conventions of this class, as illustrated by listing 2.

Listing 2: clusterMemberships Constructors
constructors[ak.CLUSTER_MEMBERSHIPS_T][ak.ARRAY_T] = function(state, arr) {
 var n = arr.length;
 var i;

 state.length = n;
 for(i=0;i<n;++i) state[i] = Number(arr[i]);
};

constructors[ak.CLUSTER_MEMBERSHIPS_T][ak.OBJECT_T] = function(state, obj) {
 var n = obj.size;
 var i;

 n = (ak.nativeType(n)===ak.FUNCTION_T) ? Number(n()) : Number(n);
 state.length = n;
 for(i=0;i<n;++i) state[i] = Number(obj.at(i));
};

Note that whilst the cluster IDs are coerced to numbers, we do not check that they are integers. This too is a consequence of the fact that this class is intended to be a component of an ak.clustering object which shall ensure that its elements are both valid and consistent with the other components.

Cluster Contents

The second representation of a clustering is a mapping from the indices of the clusters to sets of indices of the data that they contain. These sets of indices are represented by the cluster class given in listing 3.

Listing 3: The cluster Class
ak.CLUSTER_T = 'ak.cluster';

function Cluster(){}
Cluster.prototype = {TYPE: ak.CLUSTER_T, valueOf: function(){return ak.NaN;}};

function cluster(arg) {
 var c = new Cluster();
 var state = [];

 constructors[ak.CLUSTER_T][ak.nativeType(arg)](state, arg);

 c.size = function()  {return state.length;};
 c.at   = function(i) {return state[i];};

 c.toArray  = function() {return state.slice(0);};
 c.toString = function() {return '{'+state.map(function(x){return x.toFixed(0);})+'}';};

 return Object.freeze(c);
}

constructors[ak.CLUSTER_T] = {};

constructors[ak.CLUSTER_T][ak.ARRAY_T] = function(state, arr) {
 var n = arr.length;
 var i;

 state.length = n;
 for(i=0;i<n;++i) state[i] = Number(arr[i]);
};

constructors[ak.CLUSTER_T][ak.OBJECT_T] = function(state, obj) {
 var n = obj.size;
 var i;

 n = (ak.nativeType(n)===ak.FUNCTION_T) ? Number(n()) : Number(n);
 state.length = n;
 for(i=0;i<n;++i) state[i] = Number(obj.at(i));
};

Given that this class is also merely a component of a clustering it has the similar features to clusterMemberships in that users cannot create instances of it directly, its construction is dispatched to constructors[ak.CLUSTER_T] and it performs no error checking on the elements of the arrays or objects used to initialise it.

Next we need a mapping from cluster indices to objects of this type, provided by clusters in listing 4.

Listing 4: The clusters Class
ak.CLUSTERS_T = 'ak.clusters';

function Clusters(){}
Clusters.prototype = {TYPE: ak.CLUSTERS_T, valueOf: function(){return ak.NaN;}};

function clusters(arg) {
 var c = new Clusters();
 var state = [];

 constructors[ak.CLUSTERS_T][ak.nativeType(arg)](state, arg);

 c.size = function()  {return state.length;};
 c.at   = function(i) {return state[i];};

 c.toArray  = function() {return state.map(function(x){return x.toArray();});};
 c.toString = function() {return '{'+state.map(function(x){return x.toString();})+'}';};

 return Object.freeze(c);
}

constructors[ak.CLUSTERS_T] = {};

constructors[ak.CLUSTERS_T][ak.ARRAY_T] = function(state, arr) {
 var n = arr.length;
 var i;

 state.length = n;
 for(i=0;i<n;++i) state[i] = cluster(arr[i]);
};

constructors[ak.CLUSTERS_T][ak.OBJECT_T] = function(state, obj) {
 var n = obj.size;
 var i;

 n = (ak.nativeType(n)===ak.FUNCTION_T) ? Number(n()) : Number(n);
 state.length = n;
 for(i=0;i<n;++i) state[i] = cluster(obj.at(i));
};

The only significant differences between this and the previous two classes are that the elements of its state are cluster objects that are constructed from the elements of its array or object arguments and that the toArray and toString methods recursively apply those of its elements in the creation of their results.

Cluster Data

Finally, it will often be convenient to include the data that were clustered in the results of a clustering and listing 5 shows the clusterData class that we shall use for that purpose.

Listing 5: The clusterData Class
ak.CLUSTER_DATA_T = 'ak.clusterData';

function ClusterData(){}
ClusterData.prototype = {TYPE: ak.CLUSTER_DATA_T, valueOf: function(){return ak.NaN;}};

function clusterData(arg) {
 var data = new ClusterData();
 var state = [];

 constructors[ak.CLUSTER_DATA_T][ak.nativeType(arg)](state, arg);

 data.size = function()  {return state.length;};
 data.at   = function(i) {return state[i];};

 data.toArray  = function() {return state.slice(0);};
 data.toString = function() {return '{'+state.toString()+'}';};

 data.toExponential = function(d) {
  return '{'+state.map(function(x){return x.toExponential(d);})+'}';
 };

 data.toFixed = function(d) {
  return '{'+state.map(function(x){return x.toFixed(d);})+'}';
 };

 data.toPrecision = function(d) {
  return '{'+state.map(function(x){return x.toPrecision(d);})+'}';
 };

 return Object.freeze(data);
}

constructors[ak.CLUSTER_DATA_T] = {};

constructors[ak.CLUSTER_DATA_T][ak.ARRAY_T] = function(state, arr) {
 var n = arr.length;
 var i;

 state.length = n;
 for(i=0;i<n;++i) state[i] = arr[i];
};

constructors[ak.CLUSTER_DATA_T][ak.OBJECT_T] = function(state, obj) {
 var n = obj.size;
 var i;

 n = (ak.nativeType(n)===ak.FUNCTION_T) ? Number(n()) : Number(n);
 state.length = n;
 for(i=0;i<n;++i) state[i] = obj.at(i);
};

Note that this class provides all of the numeric string conversion methods since the data shall typically be ak.vector objects[2] which likewise support them. We also directly copy the elements of the arguments of the constructors rather than try to construct new objects from them because we don't know what type of data might be clustered.

ak.clustering

The ak.clustering class, given in listing 6, aggregates these component classes into a complete representation of a clustering, although you'll be forgiven if you can't see quite how it does so.

Listing 6: The ak.clustering Class
ak.CLUSTERING_T = 'ak.clustering';

function Clustering(){}
Clustering.prototype = {TYPE: ak.CLUSTERING_T, valueOf: function(){return ak.NaN;}};

ak.clustering = function() {
 var c = new Clustering();
 var arg0 = arguments[0];

 constructors[ak.nativeType(arg0)](c, arg0, arguments);
 return Object.freeze(c);
};

The reason for the complete lack of references to the components of the clustering is that rather than store them in members of some internal state object and provide access to them with methods of the ak.clustering class we shall store them in members of ak.clustering itself named, appropriately enough, memberships, clusters and data and initialised during construction, hence c being passed as the first argument of the dispatch to the constructors object.

Listing 7 implements construction from an array, which is assumed to represent cluster memberships if its first element is not an array and to represent a set of clusters if it is. We defer to the fromMemberships helper function in the first case and to the fromClusters helper function in the second to initialise the aforementioned memberships and clusters. The optional second argument is assumed to represent the clustered data and, if provided, is used to initialise the data member which is then checked for consistency with the cluster memberships.

Listing 7: The Array Constructor
constructors[ak.ARRAY_T] = function(state, arr, args) {
 var data = args[1];

 arr = arr.slice(0)
 if(ak.nativeType(arr[0])!==ak.ARRAY_T) fromMemberships(state, arr);
 else fromClusters(state, arr);

 if(ak.nativeType(data)!==ak.UNDEFINED_T) {
  state.data = clusterData(data);
  if(state.data.size()!==state.memberships.size()) {
   throw new Error('data/memberships size mismatch in ak.clustering');
  }
 }
};

When constructing from an object we can use the presence of memberships and clusters members to decide how to initialise the clustering, as illustrated by listing 8.

Listing 8: The Object Constructor
constructors[ak.OBJECT_T] = function(state, obj) {
 var c = obj.clusters;
 var m = obj.memberships;
 var d = obj.data;   

 if(ak.nativeType(c)===ak.FUNCTION_T) c = c();
 if(ak.nativeType(m)===ak.FUNCTION_T) m = m();
 if(ak.nativeType(d)===ak.FUNCTION_T) d = d();

 if(ak.nativeType(c)!==ak.UNDEFINED_T) c = clusters(c);
 if(ak.nativeType(m)!==ak.UNDEFINED_T) m = clusterMemberships(m);

 if((!c && !m) || (c && m && !validClustering(c, m))) {
  throw new Error('invalid clustering object in ak.clustering');
 }

 if(m) fromMemberships(state, m.toArray());
 else  fromClusters(state, c.toArray());

 if(ak.nativeType(d)!==ak.UNDEFINED_T) {
  state.data = clusterData(d);
  if(state.data.size()!==state.memberships.size()) {
   throw new Error('data/memberships size mismatch in ak.clustering');
  }
 }
};

Here we allow access to the properties through methods which are simply replaced by their results to transform them into values akin to those that we should expect from members. These values, where defined, are used to initialise the relevant component objects so they may consequently be of any type supported by them. Note that if both memberships and clusters are provided we shall require them to be consistent, as determined by the helper function validClustering.
There are three conditions that must be met for a clustering to be valid; firstly the memberships must refer to valid cluster IDs, secondly the elements of each cluster must be identified in the memberships as belonging to that cluster and finally no datum should appear more than once in the clusters. The validClustering helper function simply checks that these conditions are met, as shown by listing 9.

Listing 9: validClustering
function validClustering(c, m) {
 var nc = c.size();
 var nm = m.size();
 var assigned = new Array(nm);
 var i, j, mi, ci, cin, cij;

 for(i=0;i<nm;++i) {
  mi = m.at(i);
  if(mi!==ak.floor(mi) || mi<0 || mi>=nc) return false;
  assigned[i] = false;
 }

 for(i=0;i<nc;++i) {
  ci = c.at(i);
  cin = ci.size();
  for(j=0;j<cin;++j) {
   cij = ci.at(j);
   if(cij!==ak.floor(cij) || cij<0 || cij>=nm || m.at(cij)!==i || assigned[cij]) {
    return false;
   }
   assigned[cij] = true;
  }
 }

 return true;
}

Finally, Something To Get Our Teeth Into!

Now this has all been pretty straightforward and, let's be honest, rather dull, but fortunately the initialisation of a clustering is a rather more challenging proposition. This is because we shall require that the clusters member is sorted from the largest to the smallest clusters with equally sized clusters being sorted from the lowest to the highest of their lowest member IDs. Furthermore, we shall require that the clusters' member IDs are sorted from the lowest to the highest. Together, these requirements guarantee that clusterings will be unaffected by the order in which the clusters or their members are specified during initialisation.

Listing 10 shows how we do this when initialising with memberships.

Listing 10: fromMemberships
function fromMemberships(state, m) {
 var n = m.length;
 var ids = {};
 var id = 0;
 var i, mi, c, count, inv;

 for(i=0;i<n;++i) {
  mi = Number(m[i]);
  if(mi!==ak.floor(mi) || mi<0) {
   throw new Error('invalid membership in ak.clustering');
  }
  mi = '#'+mi.toFixed(0);
  if(isNaN(ids[mi])) ids[mi] = id++;
 }
 for(i=0;i<n;++i) m[i] = ids['#'+m[i].toFixed(0)];

 count = new Array(id);
 for(i=0;i<id;++i) count[i] = [i, 0];
 for(i=0;i<n;++i) count[m[i]][1] += 1;
 count.sort(function(a,b){return a[1]!==b[1] ? b[1]-a[1] : a[0]-b[0];});

 inv = new Array(id);
 for(i=0;i<id;++i) inv[count[i][0]] = i;
 for(i=0;i<n;++i) m[i] = inv[m[i]];

 c = new Array(id);
 for(i=0;i<id;++i) c[i] = [];
 for(i=0;i<n;++i) c[m[i]].push(i);

 state.clusters = clusters(c);
 state.memberships = clusterMemberships(m);
}

The first thing we do is check that the cluster IDs are valid and relabel them so that they run from zero up to the number of referenced clusters minus one, ensuring that there won't be any empty clusters in the final clustering. This is done by exploiting the array-like member access of the ids object to initialise members named with strings based on the cluster IDs with integers representing the order in which the appear in the memberships.
Next, we populate the count array with two element arrays that associate the new IDs with the number of member that they contain. We then sort them in decreasing order their sizes with ties being decided by the lower ID, which must contain members with lower IDs given that the new IDs were chosen by how early in the memberships the original IDs were referenced.
We then construct a mapping from the compressed cluster IDs to their sorted IDs in inv and use this to relabel the memberships, m, before using them to populate the clusters, c.

Initialisation with clusters is no less convoluted, as illustrated by listing 11.

Listing 11: fromClusters
function fromClusters(state, c) {
 var nc = c.length;
 var nm = 0;
 var i, j, m, n, ci, cij;

 for(i=0;i<nc;++i) {
  ci = c[i];
  if(ak.nativeType(ci)!==ak.ARRAY_T) {
   throw new Error('invalid cluster in ak.clustering');
  }
  nm += ci.length;
  c[i] = ci.concat(Math.min.apply(null, ci));
 }
 m = new Array(nm);

 c.sort(function(a,b){
  return a.length!==b.length ? b.length-a.length : a[a.length-1]-b[b.length-1];
 });

 for(i=0;i<nc;++i) {
  ci = c[i];
  ci.pop();
  ci.sort();
  n = ci.length;
  for(j=0;j<n;++j) {
   cij = Number(ci[j]);
   if(cij!==ak.floor(cij) || cij<0 || cij>=nm || !isNaN(m[cij])) {
    throw new Error('invalid member in ak.clustering');
   }
   m[cij] = i;
  }
 }

 state.clusters = clusters(c);
 state.memberships = clusterMemberships(m);
}

This time we replace the arrays containing the members of each cluster with ones that have had the lowest member ID appended to it whilst we count the number of members and check that their IDs are valid. This enables us to sort the clusters by both their size and, in the case of ties, their lowest member IDs.
We then loop through the sorted clusters, removing the appended lowest member IDs and sorting the rest of them before populating the memberships with their elements, checking that they are valid IDs and are unique. Note that we don't need to check that every member is associated with a cluster since if one were not another would have had to have turned up more than once in the clusters.

Hopefully the subtlety of the initialisation of ak.clustering, demonstrated by program 1 to yield a unique, canonical representation of a clustering, has made up for the boring triviality of that representation.

Program 1: Using ak.clustering

In the next post we shall use this class, which can be found in Clustering.js, to represent the results of our first clustering algorithm; a rather more interesting topic!

References

[1] Climbing Mount Impractical, www.thusspakeak.com, 2013.

[2] What's Our Vector, Victor?, www.thusspakeak.com, 2014.

Leave a comment