Recently in Clustering Category

Try Another K

In our exploration of the curious subject of cluster analysis, in which the goal is to classify a set of data into subsets of similar data without having a rigorous mathematical definition of what that actually means, we have covered the k means algorithm that implicitly defines a clustering as minimising the sum of squared distances of the members of the clusters from their means and have proposed that we might compare clusterings by the amount of the variance in the data that they account for.
Unfortunately, it turned out that trying to identify the actual number of clusters in the data using the accounted for variance was a rather subjective business and so in this post we shall see if we can do any better.

Full text...

submit to reddit  

A Good K To Try

We have seen how the k means algorithm can classify a set of data into k subsets of mutually similar data with the simple iterative scheme of placing each datum into the cluster whose representative it is closest to and then replacing those representatives with the means of the data in each cluster. Whilst this has a reasonably intuitive implicit definition of similarity it also has the unfortunate problem that we need to know how many clusters there are in the data if we are to have any hope of correctly identifying them.

Full text...

submit to reddit  

K Means Business

Last time we took a first brief look at cluster analysis in which we seek to algorithmically differentiate sets of data into subsets of similar data. The difficulty in doing so stems from the fact that similarity is a rather poorly defined concept and so clustering algorithms typically proceed by updating cluster memberships according to some rule of thumb, or heuristic, that is designed to reflect some intuitive notion of similarity. As a consequence, clusters have the rather unusually circular definition of that which are identified by a clustering algorithm!

Full text...

submit to reddit  

Some Of These Things Are Not Like The Others

Of all of the problems that we might attack with numerical methods, one of my favourites is cluster analysis; the task of dividing a set of data into subsets of similar data. Now I'm fully aware that my having favourite applications of numerical methods might lead you to suspect that I live at the most cringingly awkward end of nerd street, but please bear with me; the reason that I find cluster analysis so appealing, and that I hope you will too, is that it is a rare example of a mathematical problem that we struggle to even properly define.

Full text...

submit to reddit