June 2019 Archives

Making The Clade

We have so far seen a couple of schemes for identifying clusters in sets of data which are subsets whose members are in some sense similar to each other. Specifically, we have looked at the k means and the shared near neighbours algorithms which implicitly define clusters by the closeness of each datum to the average of each cluster and by their closeness to each other respectively.
Note that both of these algorithms use a heuristic, or rule of thumb, to assign data to clusters, but there's another way to construct clusterings; define a heuristic to measure how close to each other a pair of clusters are and then, starting with each datum in a cluster of its own, progressively merge the closest pairs until we end up with a single cluster containing all of the data. This means that we'll end up with a sequence of clusterings and so before we can look at such algorithms we'll need a structure to represent them.

Full text...