Recently in Clustering Category

Nearest And Dearest

Last time we saw how we could use the list of the nearest neighbours of a datum in a set of clustered data to calculate its strengths of association with their clusters. Specifically, we used the k nearest neighbours algorithm to define those strengths as the proportions of its k nearest neighbours that were members of each cluster or with a generalisation of it that assigned weights to the neighbours according to their positions in the list.
This time we shall take a look at a clustering algorithm that uses nearest neighbours to identify clusters, contrasting it with the k means clustering algorithm that we covered about four years ago.

Full text...  

In The Neighbourhood

A little under four years ago we saw how we could use the k means algorithm to divide sets of data into distinct subsets, known as clusters, whose members are in some sense similar to each other. The interesting thing about clustering is that even though we find it easy to spot clusters, at least in two dimensions, it's incredibly difficult to give them a firm mathematical definition and so clustering algorithms invariably define them implicitly as the subsets identified by this algorithm.
The k means algorithm, for example, does so by first picking k different elements of the data as cluster representatives and then places every element in the cluster whose representative is nearest to it. The cluster representatives are then replaced by the means of the elements assign to it and the process is repeated iteratively until the clusters stop changing.
Now I'd like to introduce some more clustering algorithms but there are a few things that we'll need first.

Full text...  

Try Another K

In our exploration of the curious subject of cluster analysis, in which the goal is to classify a set of data into subsets of similar data without having a rigorous mathematical definition of what that actually means, we have covered the k means algorithm that implicitly defines a clustering as minimising the sum of squared distances of the members of the clusters from their means and have proposed that we might compare clusterings by the amount of the variance in the data that they account for.
Unfortunately, it turned out that trying to identify the actual number of clusters in the data using the accounted for variance was a rather subjective business and so in this post we shall see if we can do any better.

Full text...  

A Good K To Try

We have seen how the k means algorithm can classify a set of data into k subsets of mutually similar data with the simple iterative scheme of placing each datum into the cluster whose representative it is closest to and then replacing those representatives with the means of the data in each cluster. Whilst this has a reasonably intuitive implicit definition of similarity it also has the unfortunate problem that we need to know how many clusters there are in the data if we are to have any hope of correctly identifying them.

Full text...  

K Means Business

Last time we took a first brief look at cluster analysis in which we seek to algorithmically differentiate sets of data into subsets of similar data. The difficulty in doing so stems from the fact that similarity is a rather poorly defined concept and so clustering algorithms typically proceed by updating cluster memberships according to some rule of thumb, or heuristic, that is designed to reflect some intuitive notion of similarity. As a consequence, clusters have the rather unusually circular definition of that which are identified by a clustering algorithm!

Full text...  

Some Of These Things Are Not Like The Others

Of all of the problems that we might attack with numerical methods, one of my favourites is cluster analysis; the task of dividing a set of data into subsets of similar data. Now I'm fully aware that my having favourite applications of numerical methods might lead you to suspect that I live at the most cringingly awkward end of nerd street, but please bear with me; the reason that I find cluster analysis so appealing, and that I hope you will too, is that it is a rare example of a mathematical problem that we struggle to even properly define.

Full text...