In The Neighbourhood

| 0 Comments

A little under four years ago we saw how we could use the \(k\) means algorithm[1] to divide sets of data into distinct subsets, known as clusters, whose members are in some sense similar to each other. The interesting thing about clustering is that even though we find it easy to spot clusters, at least in two dimensions, it's incredibly difficult to give them a firm mathematical definition and so clustering algorithms invariably define them implicitly as the subsets identified by this algorithm.
The \(k\) means algorithm, for example, does so by first picking \(k\) different elements of the data as cluster representatives and then places every element in the cluster whose representative is nearest to it. The cluster representatives are then replaced by the means of the elements assign to it and the process is repeated iteratively until the clusters stop changing.
Now I'd like to introduce some more clustering algorithms but there are a few things that we'll need first.

Distance Relationships

A key component in many clustering algorithms is the distance matrix whose elements are the distances between elements of the data. For example, if the data are a set of \(n\) vectors \(\mathbf{x}_i\) then we could define an \(n\) by \(n\) distance matrix \(\mathbf{M}\) with the elements
\[ m_{ij} = \left|\mathbf{x}_i - \mathbf{x}_j\right| \]
where the vertical bars stand for the magnitude of the vector between them. Note that we're not required to define distance in this way but we will typically expect its definition to have similar properties, such as
\[ \begin{align*} m_{ii} &= 0\\ m_{ij} &\geqslant 0\\ m_{ji} &= m_{ij} \end{align*} \]
and sometimes even that it satisfies the triangle inequality, which states that for a pair of vectors \(\mathbf{x}\) and \(\mathbf{y}\)
\[ \left|\mathbf{x}+\mathbf{y}\right| \leqslant \left|\mathbf{x}\right|+\left|\mathbf{y}\right| \]
so that
\[ \begin{align*} \left|\mathbf{x}_i-\mathbf{x}_j\right| &= \left|\left(\mathbf{x}_i-\mathbf{x}_k\right)+\left(\mathbf{x}_k-\mathbf{x}_j\right)\right| \leqslant \left|\mathbf{x}_i-\mathbf{x}_k\right|+\left|\mathbf{x}_k-\mathbf{x}_j\right|\\ m_{ij} &\leqslant m_{ik} + m_{kj} \end{align*} \]
We can trivially generalise this concept to capture the distances between a set of \(n\) vectors \(\mathbf{x}_i\) and another set of \(m\) vectors \(\mathbf{y}_j\) with an \(n\) by \(m\) matrix having the elements
\[ m_{ij} = \left|\mathbf{x}_i - \mathbf{y}_j\right| \]
in which case our only expectation is that they are non-negative.

To define a distance matrix for sets of objects other than vectors we simply need to define a function \(f\) that measures the distance between a pair of them and set the elements to
\[ m_{ij} = f\left(x_i, y_j\right) \]
as done by the ak.distanceMatrix function defined in listing 1.

Listing 1: ak.distanceMatrix
ak.distanceMatrix = function(lhs, rhs, dist) {
 var lt = ak.type(lhs);
 var rt = ak.type(rhs);
 var lat, rat, rows, cols, a, irc, icr, row, col, d;

 if(lt!==ak.ARRAY_T && lt!==ak.CLUSTER_DATA_T) {
  throw new Error('invalid lhs in ak.distanceMatrix');
 }

 if(rt!==ak.ARRAY_T && rt!==ak.CLUSTER_DATA_T) {
  throw new Error('invalid rhs in ak.distanceMatrix');
 }

 if(ak.nativeType(dist)===ak.UNDEFINED_T) dist = ak.dist;
 else if(ak.nativeType(dist)!==ak.FUNCTION_T) {
  throw new Error('invalid distance measure in ak.distanceMatrix');
 }

 lat = lt===ak.ARRAY_T ? function(i){return lhs[i];} : lhs.at;
 rat = rt===ak.ARRAY_T ? function(i){return rhs[i];} : rhs.at;

 rows = lt===ak.ARRAY_T ? lhs.length : lhs.size();
 cols = rt===ak.ARRAY_T ? rhs.length : rhs.size();
 a = new Array(rows*cols);

 if(lhs===rhs) {
  irc = 0;
  for(row=0;row<rows;++row) {
   icr = irc;
   a[irc] = 0;
   for(col=row+1;col<cols;++col) {
    d = Number(dist(lat(row), rat(col)));
    if(d<0) throw new Error('invalid distance in ak.distanceMatrix');
    a[++irc] = a[icr+=cols] = d;
   }
   irc += row+2;
  }
 }
 else {
  irc = 0;
  for(row=0;row<rows;++row) {
   for(col=0;col<cols;++col) {
    d = Number(dist(lat(row), rat(col)));
    if(d<0) throw new Error('invalid distance in ak.distanceMatrix');
    a[irc++] = d;
   }
  }
 }
 return ak.matrix(rows, cols, a);
}

The first thing to note is that we're allowing the data to be contained in either JavaScript arrays or our ak.clusterData type and if the distance function dist isn't defined then we default to our overloaded ak.dist function. Furthermore, since accessing elements of arrays and ak.clusterData objects take different forms, we're defining the lat and rat functions to extract the elements of lhs and rhs respectively.
If lhs and rhs are the same array or ak.clusterData object then we're cutting down on the number of calls to dist by assuming that it meets our expectations that \(m_{ii}\) equals zero and that \(m_{ji}\) equals \(m_{ij}\). Since we're initialising the ak.matrix object that the function returns with an array a containing its elements in row-wise order, this means that a[c*cols+r] should equal a[r*cols+c]. Note that instead of indexing those elements with c*cols+r and r*cols+c, we're iterating over them with icr and irc by adding cols to the former and incrementing the latter at each step, adding row+2 to irc at the end to jump to the diagonal element of the next row. We're also explicitly setting the diagonal elements of the distance matrix to zero and throwing an error if any of the results of dist are negative.
Otherwise, the only expectation is that the elements should be non-negative and so we must calculate every element directly, again throwing an error if any of them are negative.
Now if you look carefully then you'll notice that we're not treating NaN as an error when we're checking that the results of dist are non-negative; this is because it allows us to use NaN to mean incomparable.

Program 1 demonstrates how to use ak.distanceMatrix, which can be found in DistanceMatrix.js, to calculate the distance matrix for a set of three vectors, represented as an array of ak.vector objects.

Program 1: A Distance Matrix

Everybody Needs Near Neighbours

Another useful concept in clustering is that of the nearest neighbours of a datum; the subset of a set of data that are closest to it, typically arranged in order of increasing distance. To find it we simply associate the indices of the data with their distances to the datum and then sort them by those distances.
The ak.nearestNeighbours function defined in listing 2 finds the \(k\) nearest neighbours of each element from one set of data \(S_0\) in another \(S_1\), using ak.distanceMatrix to represent the distances between them.

Listing 2: ak.nearestNeighbours
function compare(ix0, ix1) return ak.floatCompare(ix0[1], ix1[1]);

ak.nearestNeighbours = function(a0, a1, a2, a3) {
 var t0 = ak.type(a0);
 var distances, k, rows, cols, row, col, a;

 if(ak.type(a0)===ak.MATRIX_T) {
  distances = a0;
  k = a1;
 }
 else if(ak.nativeType(a2)===ak.NUMBER_T) {
  distances = ak.distanceMatrix(a0, a1, a3);
  k = a2;
 }
 else {
  distances = ak.distanceMatrix(a0, a1, a2);
 }

 rows = distances.rows();
 cols = distances.cols();

 if(ak.nativeType(k)===ak.UNDEFINED_T) {
  k = cols;
 }
 else if(ak.nativeType(k)!==ak.NUMBER_T || ak.floor(k)!==k || k<0 || k>cols) {
  throw new Error('invalid neighbourhood size in ak.nearestNeighbours');
 }

 distances = distances.toArray();
 for(row=0;row<rows;++row) {
  a = distances[row];
  for(col=0;col<cols;++col) a[col] = [col, a[col]];
  ak.partialSort(a, k, compare);
  a.length = k;
  for(col=0;col<k;++col) a[col] = a[col][0];
 }
 return distances;
};

The least clear thing about this function is the way that it handles its arguments, so it bears some explanation.
If the first argument a0 is an ak.matrix object then we assume that it's the distance matrix from \(S_0\) to \(S_1\) and that a1, if defined, is \(k\).
If a2 is a number then we assume that it's \(k\), that a0 and a1 are arrays or ak.clusterData objects containing the elements of \(S_0\) and \(S_1\) and that a3, if defined, is a function to calculate the distances between the elements of the former and those of the latter. We then use ak.distanceMatrix to calculate the distance matrix between \(S_0\) and \(S_1\), relying upon it to default to the distance measure of ak.dist if the function is undefined.
Otherwise we assume that a0 is an array of the elements of \(S_0\), that a1 is an array of the elements of \(S_1\) and that a2, if defined, is a distance function, again using these to calculate a distance matrix between the former and the latter.
Finally, if \(k\) is undefined then we set it to cols, which is the size of \(S_1\), to rank all of its neighbours to the elements of \(S_0\), otherwise throwing an error if it's not a positive integer that's no greater than its size.
We can summarise all of this for a distance matrix m, a neighbourhood size k, data sets s0 and s1 and a distance function f with
ak.nearestNeighbours(m, k); ak.nearestNeighbours(m); //k = m.cols(); ak.nearestNeighbours(s0, s1, k); //m = distanceMatrix(s0, s1); ak.nearestNeighbours(s0, s1); //m = distanceMatrix(s0, s1); k = m.cols(); ak.nearestNeighbours(s0, s1, k, f); //m = distanceMatrix(s0, s1, f); ak.nearestNeighbours(s0, s1, f); //m = distanceMatrix(s0, s1, f); k = m.cols();

Once we have the distance matrix we simply convert it to an array with toArray and associate each distance in each row with its index, then partially sort them with respect to their distances with ak.partialSort using compare, truncate them to a length of \(k\) and finally set their elements to the ranked indices.
Note that the compare function is using ak.floatCompare so that NaNs are treated as greater than any other numbers and consequently incomparable elements are ranked last.

Program 2 uses ak.nearestNeighbours, defined in NearestNeighbours.js, to find the three nearest neighbours of each vector in x in the vectors in y.

Program 2: A Nearest Neighbour Table

Class Divisions

If we have a clustering of a set of data then we can classify a datum using its nearest neighbours in them. Specifically, if \(k_i\) of its \(k\) nearest neighbours are members of the \(i^\mathrm{th}\) cluster then we can use \(p_i=\frac{k_i}{k}\) as a measure of how strongly it's associated with that cluster.
For example, given the clusters
\[ \begin{align*} C_0 &= \{1,2,5,6,11,12,13,15\}\\ C_1 &= \{3,7,9,10,14,16,18\}\\ C_2 &= \{0,4,8,17,19\} \end{align*} \]
and the five nearest neighbours of a datum
\[ n = [15,0,13,1,18] \]
then it has cluster association strengths of
\[ \begin{align*} p_0 &= \tfrac35\\ p_1 &= \tfrac15\\ p_2 &= \tfrac15 \end{align*} \]
and, by picking the greatest, we can classify it as belonging to the class of data represented by the first cluster.
This is known, somewhat unimaginatively, as the \(k\) nearest neighbours classification algorithm and is one of the earliest examples of machine learning. In fact, to this day machine learning classification algorithms typically output arrays of class association strengths, the greatest of which is used to identify the class of the input.

We can generalise this algorithm to give greater importance to nearer neighbours by defining a set of neighbour weights \(w_j\) and calculating the datum's strength of association with the \(i^\mathrm{th}\) cluster by summing the weights of its nearest neighbours that are members of that cluster and dividing the result by the sum of all of the weights. In mathematical notation we can express this as
\[ p_i = \frac{\sum_{j=0}^{k-1} \mathbf{1}_{C_i}\left(n_j\right) \times w_j}{\sum_{j=0}^{k-1} w_j} \]
where \(\sum\) is the summation sign and \(\mathbf{1}_{C_i}\) is an indicator function, which are defined for a set \(X\) by
\[ \mathbf{1}_X(x) = \begin{cases} 1 & x \in X\\ 0 & x \notin X \end{cases} \]
where \(\in\) means in and \(\notin\) means not in, and so equals one if the \(j^\mathrm{th}\) nearest neighbour is a member of the \(i^\mathrm{th}\) cluster and zero otherwise.
For example, if we were to choose the weights
\[ w = \left[\tfrac12, \tfrac14, \tfrac18, \tfrac1{16}, \tfrac1{32}\right] \]
then we'd have
\[ \sum_{j=0}^4 w_j = \tfrac{31}{32} \]
and
\[ \begin{align*} p_0 &= \frac{1 \times \tfrac12 + 0 \times \tfrac14 + 1 \times \tfrac18 + 1 \times \tfrac1{16} + 0 \times \tfrac1{32}}{\tfrac{31}{32}} = \frac{1 \times 16 + 0 \times 8 + 1 \times 4 + 1 \times 2 + 0 \times 1}{31} = \frac{22}{31}\\ p_1 &= \frac{0 \times \tfrac12 + 0 \times \tfrac14 + 0 \times \tfrac18 + 0 \times \tfrac1{16} + 1 \times \tfrac1{32}}{\tfrac{31}{32}} = \frac{0 \times 16 + 0 \times 8 + 0 \times 4 + 0 \times 2 + 1 \times 1}{31} = \frac1{31}\\ p_2 &= \frac{0 \times \tfrac12 + 1 \times \tfrac14 + 0 \times \tfrac18 + 0 \times \tfrac1{16} + 0 \times \tfrac1{32}}{\tfrac{31}{32}} = \frac{0 \times 16 + 1 \times 8 + 0 \times 4 + 0 \times 2 + 0 \times 1}{31} = \frac8{31} \end{align*} \]
Both variants of the algorithm are implemented by the ak.knnClasses function in listing 3.

Listing 3: ak.knnClasses
ak.knnClasses = function(candidates, arg1, arg2, arg3, arg4) {
 var t0 = ak.type(candidates);
 var sum = 0;
 var t1, data, clusters, memberships, weights, wi, neighbours, classes, i;

 if(t0!==ak.ARRAY_T && t0!==ak.CLUSTER_DATA_T) {
  throw new Error('invalid candidates in ak.knnClasses');
 }

 t1 = ak.type(arg1);
 if(t1===ak.ARRAY_T || t1===ak.CLUSTER_DATA_T) {
  data = arg1;
  if(ak.type(arg2)!==ak.CLUSTERING_T) {
   throw new Error('invalid clustering in ak.knnClasses');
  }
  clusters = arg2.clusters;
  memberships = arg2.memberships;
  arg2 = arg3; arg3 = arg4;
 }
 else if(t1===ak.CLUSTERING_T) {
  data = arg1.data;
  clusters = arg1.clusters;
  memberships = arg1.memberships;
 }
 else throw new Error('invalid clustering in ak.knnClasses');

 if(ak.nativeType(arg2)===ak.ARRAY_T) {
  weights = arg2;
  for(i=0;i<weights.length;++i) {
   wi = weights[i];
   if(ak.nativeType(wi)!==ak.NUMBER_T || wi<0 || !isFinite(wi)) {
    throw new Error('invalid weight in ak.knnClasses');
   }
   sum += wi;
  }
  if(!isFinite(sum)) throw new Error('non-finite weight in ak.knnClasses');
  arg2 = weights.length;
 }

 neighbours = ak.nearestNeighbours(candidates, data, arg2, arg3);
 classes = new Array(candidates.length);

 if(sum>0) {
  for(i=0;i<candidates.length;++i) {
   classes[i]=classifyWeighted(neighbours[i],memberships,clusters.size(),weights,sum);
  }
 }
 else {
  for(i=0;i<candidates.length;++i) {
   classes[i]=classifyUniform(neighbours[i],memberships,clusters.size());
  }
 }
 return classes;
};

Once again, most of the code is dealing with the arguments and needs some explanation.
First of all, the candidates are expected to be an array or an ak.clusterData object containing the data to be classified.
Next, if arg1 is also an array or an ak.clusterData object then we assume that it's the clustered data that we'll use to classify the candidates and require that arg2 is its clustering, referencing its clusters and memberships objects before shifting arg3 and arg4 to the left.
Otherwise, we require it to be an ak.clustering and reference its data, clusters and memberships.
If arg2 is an array then we assume that it contains the weights, saving them as weights, calculating their sum and setting arg2 to their length.
Finally, we calculate the nearest neighbours of the candidates in the data with arg2 representing \(k\) and arg3 the distance function, if defined.
In summary, this means that we can classify a set of candidates s using data d, a clustering c, number of nearest neighbours k, weights w and distance function f with any of
ak.knnClasses(s, d, c, k, f); //n = ak.nearestNeighbours(s, d, k, f); ak.knnClasses(s, d, c, k); //n = ak.nearestNeighbours(s, d, k); ak.knnClasses(s, d, c, w, f); //n = ak.nearestNeighbours(s, d, w.length, f); ak.knnClasses(s, d, c, w); //n = ak.nearestNeighbours(s, d, w.length); ak.knnClasses(s, d, c, f); //n = ak.nearestNeighbours(s, d, f); ak.knnClasses(s, d, c); //n = ak.nearestNeighbours(s, d); ak.knnClasses(s, c, k, f); //n = ak.nearestNeighbours(s, c.data(), k, f); ak.knnClasses(s, c, k); //n = ak.nearestNeighbours(s, c.data(), k); ak.knnClasses(s, c, w, f); //n = ak.nearestNeighbours(s, c.data(), w.length, f); ak.knnClasses(s, c, w); //n = ak.nearestNeighbours(s, c.data(), w.length); ak.knnClasses(s, c, f); //n = ak.nearestNeighbours(s, c.data(), f); ak.knnClasses(s, c); //n = ak.nearestNeighbours(s, c.data());

If sum equals zero then either all of the weights are equal to zero or they weren't specified and we use the basic version of the \(k\) nearest neighbours classification algorithm, implemented by classifyUniform in listing 4.

Listing 4: Uniform Classification
function classifyUniform(neighbours, memberships, n) {
 var k = neighbours.length;
 var classes = new Array(n);
 var scale = 1/k;
 var i;

 for(i=0;i<n;++i) classes[i] = 0;
 for(i=0;i<k;++i) classes[memberships.at(neighbours[i])] += scale;
 return classes;
}

Note that we're using the cluster memberships to identify which cluster each neighbour belongs to.
If it's greater than zero then we use the generalised version instead, implemented by classifyWeighted in listing 5.

Listing 5: Weighted Classification
function classifyWeighted(neighbours, memberships, n, weights, sum) {
 var k = neighbours.length;
 var classes = new Array(n);
 var scale = 1/sum;
 var i;

 for(i=0;i<n;++i) classes[i] = 0;
 for(i=0;i<k;++i) classes[memberships.at(neighbours[i])] += weights[i]*scale;
 return classes;
}

Program 3 shows how ak.knnClasses classifies the points in the plane without weights with respect to three clusters generated by our ak.kMeansClustering function by drawing the points in the clusters in red, green and blue and giving the classified points a mix of their colours determined by their strengths of association with them, exaggerated by using their square roots.

Program 3: Uniform Classification Of The Plane

The blended colours of the points that lie between the clusters and the monochrome colours of the points that are close to them clearly show how the \(k\) nearest neighbours algorithm is making associations.
Note that you can add new points to the clusters by clicking on the chart area and clear them by clicking the button to the left of the reset button.

In contrast, program 4 shows how a weighting scheme like the one we used in the example of the generalised algorithm makes the associations with the clusters much stronger.

Program 4: Weighted Classification Of The Plane

You can find the implementation of ak.knnClasses in KNNClasses.js and in the next post we shall take a look at a way that we can use nearest neighbours to generate clusterings.

References

[1] K Means Business, www.thusspakeak.com, 2015.

Leave a comment