September 2015 Archives

Try Another K

In our exploration of the curious subject of cluster analysis, in which the goal is to classify a set of data into subsets of similar data without having a rigorous mathematical definition of what that actually means, we have covered the k means algorithm that implicitly defines a clustering as minimising the sum of squared distances of the members of the clusters from their means and have proposed that we might compare clusterings by the amount of the variance in the data that they account for.
Unfortunately, it turned out that trying to identify the actual number of clusters in the data using the accounted for variance was a rather subjective business and so in this post we shall see if we can do any better.

Full text...

submit to reddit