We have seen how the

*k*means algorithm can classify a set of data into*k*subsets of mutually similar data with the simple iterative scheme of placing each datum into the cluster whose representative it is closest to and then replacing those representatives with the means of the data in each cluster. Whilst this has a reasonably intuitive implicit definition of similarity it also has the unfortunate problem that we need to know how many clusters there are in the data if we are to have any hope of correctly identifying them.Full text...