A little under four years ago we saw how we could use the \(k\) means algorithm^{[1]} to divide sets of data into distinct subsets, known as clusters, whose members are in some sense similar to each other. The interesting thing about clustering is that even though we find it easy to spot clusters, at least in two dimensions, it's incredibly difficult to give them a firm mathematical definition and so clustering algorithms invariably define them implicitly as the subsets identified by this algorithm.
The \(k\) means algorithm, for example, does so by first picking \(k\) different elements of the data as cluster representatives and then places every element in the cluster whose representative is nearest to it. The cluster representatives are then replaced by the means of the elements assign to it and the process is repeated iteratively until the clusters stop changing.
Now I'd like to introduce some more clustering algorithms but there are a few things that we'll need first.
To define a distance matrix for sets of objects other than vectors we simply need to define a function \(f\) that measures the distance between a pair of them and set the elements to
The first thing to note is that we're allowing the data to be contained in either JavaScript arrays or our
If should equal . Note that instead of indexing those elements with and , we're iterating over them with to
Otherwise, the only expectation is that the elements should be nonnegative and so we must calculate every element directly, again throwing an error if any of them are negative.
Now if you look carefully then you'll notice that we're not treating NaN as an error when we're checking that the results of
Program 1 demonstrates how to use
The
The least clear thing about this function is the way that it handles its arguments, so it bears some explanation.
If the first argument
If
Otherwise we assume that
Finally, if \(k\) is undefined then we set it to
We can summarise all of this for a distance matrix
Once we have the distance matrix we simply convert it to an array with
Note that the
Program 2 uses
For example, given the clusters
This is known, somewhat unimaginatively, as the \(k\) nearest neighbours classification algorithm and is one of the earliest examples of machine learning. In fact, to this day machine learning classification algorithms typically output arrays of class association strengths, the greatest of which is used to identify the class of the input.
We can generalise this algorithm to give greater importance to nearer neighbours by defining a set of neighbour weights \(w_j\) and calculating the datum's strength of association with the \(i^\mathrm{th}\) cluster by summing the weights of its nearest neighbours that are members of that cluster and dividing the result by the sum of all of the weights. In mathematical notation we can express this as
For example, if we were to choose the weights
Once again, most of the code is dealing with the arguments and needs some explanation.
First of all, the
Next, if
Otherwise, we require it to be an
If
Finally, we calculate the nearest
In summary, this means that we can classify a set of candidates
If
Note that we're using the cluster
If it's greater than zero then we use the generalised version instead, implemented by
Program 3 shows how
The blended colours of the points that lie between the clusters and the monochrome colours of the points that are close to them clearly show how the \(k\) nearest neighbours algorithm is making associations.
Note that you can add new points to the clusters by clicking on the chart area and clear them by clicking the button to the left of the reset button.
In contrast, program 4 shows how a weighting scheme like the one we used in the example of the generalised algorithm makes the associations with the clusters much stronger.
You can find the implementation of
The \(k\) means algorithm, for example, does so by first picking \(k\) different elements of the data as cluster representatives and then places every element in the cluster whose representative is nearest to it. The cluster representatives are then replaced by the means of the elements assign to it and the process is repeated iteratively until the clusters stop changing.
Now I'd like to introduce some more clustering algorithms but there are a few things that we'll need first.
Distance Relationships
A key component in many clustering algorithms is the distance matrix whose elements are the distances between elements of the data. For example, if the data are a set of \(n\) vectors \(\mathbf{x}_i\) then we could define an \(n\) by \(n\) distance matrix \(\mathbf{M}\) with the elements
\[
m_{ij} = \left\mathbf{x}_i  \mathbf{x}_j\right
\]
where the vertical bars stand for the magnitude of the vector between them. Note that we're not required to define distance in this way but we will typically expect its definition to have similar properties, such as
\[
\begin{align*}
m_{ii} &= 0\\
m_{ij} &\geqslant 0\\
m_{ji} &= m_{ij}
\end{align*}
\]
and sometimes even that it satisfies the triangle inequality, which states that for a pair of vectors \(\mathbf{x}\) and \(\mathbf{y}\)
\[
\left\mathbf{x}+\mathbf{y}\right \leqslant \left\mathbf{x}\right+\left\mathbf{y}\right
\]
so that
\[
\begin{align*}
\left\mathbf{x}_i\mathbf{x}_j\right
&= \left\left(\mathbf{x}_i\mathbf{x}_k\right)+\left(\mathbf{x}_k\mathbf{x}_j\right)\right
\leqslant \left\mathbf{x}_i\mathbf{x}_k\right+\left\mathbf{x}_k\mathbf{x}_j\right\\
m_{ij} &\leqslant m_{ik} + m_{kj}
\end{align*}
\]
We can trivially generalise this concept to capture the distances between a set of \(n\) vectors \(\mathbf{x}_i\) and another set of \(m\) vectors \(\mathbf{y}_j\) with an \(n\) by \(m\) matrix having the elements
\[
m_{ij} = \left\mathbf{x}_i  \mathbf{y}_j\right
\]
in which case our only expectation is that they are nonnegative.To define a distance matrix for sets of objects other than vectors we simply need to define a function \(f\) that measures the distance between a pair of them and set the elements to
\[
m_{ij} = f\left(x_i, y_j\right)
\]
as done by the ak.distanceMatrix
function defined in listing 1.Listing 1: ak.distanceMatrix
ak.distanceMatrix = function(lhs, rhs, dist) { var lt = ak.type(lhs); var rt = ak.type(rhs); var lat, rat, rows, cols, a, irc, icr, row, col, d; if(lt!==ak.ARRAY_T && lt!==ak.CLUSTER_DATA_T) { throw new Error('invalid lhs in ak.distanceMatrix'); } if(rt!==ak.ARRAY_T && rt!==ak.CLUSTER_DATA_T) { throw new Error('invalid rhs in ak.distanceMatrix'); } if(ak.nativeType(dist)===ak.UNDEFINED_T) dist = ak.dist; else if(ak.nativeType(dist)!==ak.FUNCTION_T) { throw new Error('invalid distance measure in ak.distanceMatrix'); } lat = lt===ak.ARRAY_T ? function(i){return lhs[i];} : lhs.at; rat = rt===ak.ARRAY_T ? function(i){return rhs[i];} : rhs.at; rows = lt===ak.ARRAY_T ? lhs.length : lhs.size(); cols = rt===ak.ARRAY_T ? rhs.length : rhs.size(); a = new Array(rows*cols); if(lhs===rhs) { irc = 0; for(row=0;row<rows;++row) { icr = irc; a[irc] = 0; for(col=row+1;col<cols;++col) { d = Number(dist(lat(row), rat(col))); if(d<0) throw new Error('invalid distance in ak.distanceMatrix'); a[++irc] = a[icr+=cols] = d; } irc += row+2; } } else { irc = 0; for(row=0;row<rows;++row) { for(col=0;col<cols;++col) { d = Number(dist(lat(row), rat(col))); if(d<0) throw new Error('invalid distance in ak.distanceMatrix'); a[irc++] = d; } } } return ak.matrix(rows, cols, a); }
The first thing to note is that we're allowing the data to be contained in either JavaScript arrays or our
ak.clusterData
type and if the distance function dist
isn't defined then we default to our overloaded ak.dist
function. Furthermore, since accessing elements of arrays and ak.clusterData
objects take different forms, we're defining the lat
and rat
functions to extract the elements of lhs
and rhs
respectively.If
lhs
and rhs
are the same array or ak.clusterData
object then we're cutting down on the number of calls to dist
by assuming that it meets our expectations that \(m_{ii}\) equals zero and that \(m_{ji}\) equals \(m_{ij}\). Since we're initialising the ak.matrix
object that the function returns with an array a
containing its elements in rowwise order, this means that a[c*cols+r]
a[r*cols+c]
c*cols+r
r*cols+c
icr
and irc
by adding cols
to the former and incrementing the latter at each step, adding row+2
irc
at the end to jump to the diagonal element of the next row. We're also explicitly setting the diagonal elements of the distance matrix to zero and throwing an error if any of the results of dist
are negative.Otherwise, the only expectation is that the elements should be nonnegative and so we must calculate every element directly, again throwing an error if any of them are negative.
Now if you look carefully then you'll notice that we're not treating NaN as an error when we're checking that the results of
dist
are nonnegative; this is because it allows us to use NaN to mean incomparable.Program 1 demonstrates how to use
ak.distanceMatrix
, which can be found in DistanceMatrix.js, to calculate the distance matrix for a set of three vectors, represented as an array of ak.vector
objects.
Program 1: A Distance Matrix



Everybody Needs Near Neighbours
Another useful concept in clustering is that of the nearest neighbours of a datum; the subset of a set of data that are closest to it, typically arranged in order of increasing distance. To find it we simply associate the indices of the data with their distances to the datum and then sort them by those distances.The
ak.nearestNeighbours
function defined in listing 2 finds the \(k\) nearest neighbours of each element from one set of data \(S_0\) in another \(S_1\), using ak.distanceMatrix
to represent the distances between them.Listing 2: ak.nearestNeighbours
function compare(ix0, ix1) return ak.floatCompare(ix0[1], ix1[1]); ak.nearestNeighbours = function(a0, a1, a2, a3) { var t0 = ak.type(a0); var distances, k, rows, cols, row, col, a; if(ak.type(a0)===ak.MATRIX_T) { distances = a0; k = a1; } else if(ak.nativeType(a2)===ak.NUMBER_T) { distances = ak.distanceMatrix(a0, a1, a3); k = a2; } else { distances = ak.distanceMatrix(a0, a1, a2); } rows = distances.rows(); cols = distances.cols(); if(ak.nativeType(k)===ak.UNDEFINED_T) { k = cols; } else if(ak.nativeType(k)!==ak.NUMBER_T  ak.floor(k)!==k  k<0  k>cols) { throw new Error('invalid neighbourhood size in ak.nearestNeighbours'); } distances = distances.toArray(); for(row=0;row<rows;++row) { a = distances[row]; for(col=0;col<cols;++col) a[col] = [col, a[col]]; ak.partialSort(a, k, compare); a.length = k; for(col=0;col<k;++col) a[col] = a[col][0]; } return distances; };
The least clear thing about this function is the way that it handles its arguments, so it bears some explanation.
If the first argument
a0
is an ak.matrix
object then we assume that it's the distance matrix from \(S_0\) to \(S_1\) and that a1
, if defined, is \(k\).If
a2
is a number then we assume that it's \(k\), that a0
and a1
are arrays or ak.clusterData
objects containing the elements of \(S_0\) and \(S_1\) and that a3
, if defined, is a function to calculate the distances between the elements of the former and those of the latter. We then use ak.distanceMatrix
to calculate the distance matrix between \(S_0\) and \(S_1\), relying upon it to default to the distance measure of ak.dist
if the function is undefined.Otherwise we assume that
a0
is an array of the elements of \(S_0\), that a1
is an array of the elements of \(S_1\) and that a2
, if defined, is a distance function, again using these to calculate a distance matrix between the former and the latter.Finally, if \(k\) is undefined then we set it to
cols
, which is the size of \(S_1\), to rank all of its neighbours to the elements of \(S_0\), otherwise throwing an error if it's not a positive integer that's no greater than its size.We can summarise all of this for a distance matrix
m
, a neighbourhood size k
, data sets s0
and s1
and a distance function f
with
ak.nearestNeighbours(m, k);
ak.nearestNeighbours(m); //k = m.cols();
ak.nearestNeighbours(s0, s1, k); //m = distanceMatrix(s0, s1);
ak.nearestNeighbours(s0, s1); //m = distanceMatrix(s0, s1); k = m.cols();
ak.nearestNeighbours(s0, s1, k, f); //m = distanceMatrix(s0, s1, f);
ak.nearestNeighbours(s0, s1, f); //m = distanceMatrix(s0, s1, f); k = m.cols();
Once we have the distance matrix we simply convert it to an array with
toArray
and associate each distance in each row with its index, then partially sort them with respect to their distances with ak.partialSort
using compare
, truncate them to a length of \(k\) and finally set their elements to the ranked indices.Note that the
compare
function is using ak.floatCompare
so that NaNs are treated as greater than any other numbers and consequently incomparable elements are ranked last.Program 2 uses
ak.nearestNeighbours
, defined in NearestNeighbours.js, to find the three nearest neighbours of each vector in x
in the vectors in y
.
Program 2: A Nearest Neighbour Table



Class Divisions
If we have a clustering of a set of data then we can classify a datum using its nearest neighbours in them. Specifically, if \(k_i\) of its \(k\) nearest neighbours are members of the \(i^\mathrm{th}\) cluster then we can use \(p_i=\frac{k_i}{k}\) as a measure of how strongly it's associated with that cluster.For example, given the clusters
\[
\begin{align*}
C_0 &= \{1,2,5,6,11,12,13,15\}\\
C_1 &= \{3,7,9,10,14,16,18\}\\
C_2 &= \{0,4,8,17,19\}
\end{align*}
\]
and the five nearest neighbours of a datum
\[
n = [15,0,13,1,18]
\]
then it has cluster association strengths of
\[
\begin{align*}
p_0 &= \tfrac35\\
p_1 &= \tfrac15\\
p_2 &= \tfrac15
\end{align*}
\]
and, by picking the greatest, we can classify it as belonging to the class of data represented by the first cluster.This is known, somewhat unimaginatively, as the \(k\) nearest neighbours classification algorithm and is one of the earliest examples of machine learning. In fact, to this day machine learning classification algorithms typically output arrays of class association strengths, the greatest of which is used to identify the class of the input.
We can generalise this algorithm to give greater importance to nearer neighbours by defining a set of neighbour weights \(w_j\) and calculating the datum's strength of association with the \(i^\mathrm{th}\) cluster by summing the weights of its nearest neighbours that are members of that cluster and dividing the result by the sum of all of the weights. In mathematical notation we can express this as
\[
p_i = \frac{\sum_{j=0}^{k1} \mathbf{1}_{C_i}\left(n_j\right) \times w_j}{\sum_{j=0}^{k1} w_j}
\]
where \(\sum\) is the summation sign and \(\mathbf{1}_{C_i}\) is an indicator function, which are defined for a set \(X\) by
\[
\mathbf{1}_X(x) =
\begin{cases}
1 & x \in X\\
0 & x \notin X
\end{cases}
\]
where \(\in\) means in and \(\notin\) means not in, and so equals one if the \(j^\mathrm{th}\) nearest neighbour is a member of the \(i^\mathrm{th}\) cluster and zero otherwise.For example, if we were to choose the weights
\[
w = \left[\tfrac12, \tfrac14, \tfrac18, \tfrac1{16}, \tfrac1{32}\right]
\]
then we'd have
\[
\sum_{j=0}^4 w_j = \tfrac{31}{32}
\]
and
\[
\begin{align*}
p_0 &= \frac{1 \times \tfrac12 + 0 \times \tfrac14 + 1 \times \tfrac18 + 1 \times \tfrac1{16} + 0 \times \tfrac1{32}}{\tfrac{31}{32}}
= \frac{1 \times 16 + 0 \times 8 + 1 \times 4 + 1 \times 2 + 0 \times 1}{31}
= \frac{22}{31}\\
p_1 &= \frac{0 \times \tfrac12 + 0 \times \tfrac14 + 0 \times \tfrac18 + 0 \times \tfrac1{16} + 1 \times \tfrac1{32}}{\tfrac{31}{32}}
= \frac{0 \times 16 + 0 \times 8 + 0 \times 4 + 0 \times 2 + 1 \times 1}{31}
= \frac1{31}\\
p_2 &= \frac{0 \times \tfrac12 + 1 \times \tfrac14 + 0 \times \tfrac18 + 0 \times \tfrac1{16} + 0 \times \tfrac1{32}}{\tfrac{31}{32}}
= \frac{0 \times 16 + 1 \times 8 + 0 \times 4 + 0 \times 2 + 0 \times 1}{31}
= \frac8{31}
\end{align*}
\]
Both variants of the algorithm are implemented by the ak.knnClasses
function in listing 3.Listing 3: ak.knnClasses
ak.knnClasses = function(candidates, arg1, arg2, arg3, arg4) { var t0 = ak.type(candidates); var sum = 0; var t1, data, clusters, memberships, weights, wi, neighbours, classes, i; if(t0!==ak.ARRAY_T && t0!==ak.CLUSTER_DATA_T) { throw new Error('invalid candidates in ak.knnClasses'); } t1 = ak.type(arg1); if(t1===ak.ARRAY_T  t1===ak.CLUSTER_DATA_T) { data = arg1; if(ak.type(arg2)!==ak.CLUSTERING_T) { throw new Error('invalid clustering in ak.knnClasses'); } clusters = arg2.clusters; memberships = arg2.memberships; arg2 = arg3; arg3 = arg4; } else if(t1===ak.CLUSTERING_T) { data = arg1.data; clusters = arg1.clusters; memberships = arg1.memberships; } else throw new Error('invalid clustering in ak.knnClasses'); if(ak.nativeType(arg2)===ak.ARRAY_T) { weights = arg2; for(i=0;i<weights.length;++i) { wi = weights[i]; if(ak.nativeType(wi)!==ak.NUMBER_T  wi<0  !isFinite(wi)) { throw new Error('invalid weight in ak.knnClasses'); } sum += wi; } if(!isFinite(sum)) throw new Error('nonfinite weight in ak.knnClasses'); arg2 = weights.length; } neighbours = ak.nearestNeighbours(candidates, data, arg2, arg3); classes = new Array(candidates.length); if(sum>0) { for(i=0;i<candidates.length;++i) { classes[i]=classifyWeighted(neighbours[i],memberships,clusters.size(),weights,sum); } } else { for(i=0;i<candidates.length;++i) { classes[i]=classifyUniform(neighbours[i],memberships,clusters.size()); } } return classes; };
Once again, most of the code is dealing with the arguments and needs some explanation.
First of all, the
candidates
are expected to be an array or an ak.clusterData
object containing the data to be classified.Next, if
arg1
is also an array or an ak.clusterData
object then we assume that it's the clustered data that we'll use to classify the candidates
and require that arg2
is its clustering, referencing its clusters
and memberships
objects before shifting arg3
and arg4
to the left.Otherwise, we require it to be an
ak.clustering
and reference its data
, clusters
and memberships
.If
arg2
is an array then we assume that it contains the weights, saving them as weights
, calculating their sum and setting arg2
to their length.Finally, we calculate the nearest
neighbours
of the candidates
in the data
with arg2
representing \(k\) and arg3
the distance function, if defined.In summary, this means that we can classify a set of candidates
s
using data d
, a clustering c
, number of nearest neighbours k
, weights w
and distance function f
with any of
ak.knnClasses(s, d, c, k, f); //n = ak.nearestNeighbours(s, d, k, f);
ak.knnClasses(s, d, c, k); //n = ak.nearestNeighbours(s, d, k);
ak.knnClasses(s, d, c, w, f); //n = ak.nearestNeighbours(s, d, w.length, f);
ak.knnClasses(s, d, c, w); //n = ak.nearestNeighbours(s, d, w.length);
ak.knnClasses(s, d, c, f); //n = ak.nearestNeighbours(s, d, f);
ak.knnClasses(s, d, c); //n = ak.nearestNeighbours(s, d);
ak.knnClasses(s, c, k, f); //n = ak.nearestNeighbours(s, c.data(), k, f);
ak.knnClasses(s, c, k); //n = ak.nearestNeighbours(s, c.data(), k);
ak.knnClasses(s, c, w, f); //n = ak.nearestNeighbours(s, c.data(), w.length, f);
ak.knnClasses(s, c, w); //n = ak.nearestNeighbours(s, c.data(), w.length);
ak.knnClasses(s, c, f); //n = ak.nearestNeighbours(s, c.data(), f);
ak.knnClasses(s, c); //n = ak.nearestNeighbours(s, c.data());
If
sum
equals zero then either all of the weights are equal to zero or they weren't specified and we use the basic version of the \(k\) nearest neighbours classification algorithm, implemented by classifyUniform
in listing 4.Listing 4: Uniform Classification
function classifyUniform(neighbours, memberships, n) { var k = neighbours.length; var classes = new Array(n); var scale = 1/k; var i; for(i=0;i<n;++i) classes[i] = 0; for(i=0;i<k;++i) classes[memberships.at(neighbours[i])] += scale; return classes; }
Note that we're using the cluster
memberships
to identify which cluster each neighbour belongs to.If it's greater than zero then we use the generalised version instead, implemented by
classifyWeighted
in listing 5.Listing 5: Weighted Classification
function classifyWeighted(neighbours, memberships, n, weights, sum) { var k = neighbours.length; var classes = new Array(n); var scale = 1/sum; var i; for(i=0;i<n;++i) classes[i] = 0; for(i=0;i<k;++i) classes[memberships.at(neighbours[i])] += weights[i]*scale; return classes; }
Program 3 shows how
ak.knnClasses
classifies the points in the plane without weights with respect to three clusters generated by our ak.kMeansClustering
function by drawing the points in the clusters in red, green and blue and giving the classified points a mix of their colours determined by their strengths of association with them, exaggerated by using their square roots.
Program 3: Uniform Classification Of The Plane



The blended colours of the points that lie between the clusters and the monochrome colours of the points that are close to them clearly show how the \(k\) nearest neighbours algorithm is making associations.
Note that you can add new points to the clusters by clicking on the chart area and clear them by clicking the button to the left of the reset button.
In contrast, program 4 shows how a weighting scheme like the one we used in the example of the generalised algorithm makes the associations with the clusters much stronger.
Program 4: Weighted Classification Of The Plane



You can find the implementation of
ak.knnClasses
in KNNClasses.js and in the next post we shall take a look at a way that we can use nearest neighbours to generate clusterings.
Leave a comment