Let's Talk About Sets

| 0 Comments

In the last couple of posts we have seen various ways to partially or fully sort data[1] and the kinds of queries that we can run against them once they have been[2]. Such query operations make fully sorted arrays a convenient way to represent sets, or more accurately multisets which treat repeated elements as distinct from each other, and in this post we shall exploit this fact to implement some operations that we might wish to perform upon them.

A Few Become One

The first thing that we can do is transform a sorted range representing a multiset into one representing a set by removing any elements that compare equal to any others. Since the range is sorted we can do this by first copying the first element after the one at the start of the range that compares unequal to it into the position immediately after it, then repeating this operation for the, possibly new, second element, then the third and so on until there are no more elements in the range to test. Finally, the position following the last into which an element was copied marks the end of the transformed range.

Now we shan't actually need to perform full comparisons between elements since we're only interested in equality and it will therefore be convenient to have at our disposal boolean equality operators that are consistent with the comparison operators that we implemented in Compare.js, as given in listing 1 and defined in Equal.js.

Listing 1: Our Equality Operators
ak.alphaEqual = function(l, r) {
 return String(l)===String(r);
};

ak.numberEqual = function(l, r) {
 return Number(l)===Number(r);
};

ak.floatEqual = function(l, r) {
 l = Number(l);
 r = Number(r);
 return l===r || (isNaN(l) && isNaN(r));
};

Specifically, we have ak.alphaEqual that treats two elements as equal if their string representations are, in keeping with the default behaviour of Array.sort, ak.numberEqual that does so if they compare equal after being coerced to numbers and ak.floatEqual that does the same as the latter except that it treats NaNs as being equal to each other.

The ak.unique function implements the transformation of a sorted range from a multiset to a set, as shown in listing 2.

Listing 2: ak.unique
ak.unique = function(a, pred, start, end) {
 var pos;

 if(ak.nativeType(pred)===ak.UNDEFINED_T) {
  pred = ak.alphaEqual;
 }
 else if(ak.nativeType(pred)!==ak.FUNCTION_T) {
  throw new Error('invalid predicate in ak.unique');
 }

 start = ak.arrayIndex(a, start, 'ak.unique');
 end = ak.arrayIndex(a, end, 'ak.unique');

 if(ak.nativeType(start)===ak.UNDEFINED_T) start = 0;
 if(ak.nativeType(end)===ak.UNDEFINED_T) end = a.length;

 if(start>=end) return start===end ? start : -1;

 pos = start;
 while(++start<end) if(!pred(a[start],a[pos])) a[++pos] = a[start];
 return ++pos;
};

As usual, we begin by checking that the arguments are valid and providing defaults for those that the user hasn't provided. Specifically, we use ak.alphaEqual as the default equality operator so that we're consistent with the default comparison operator when sorting, and we set set the start and end of the range, being the position immediately after its last element, to the start and end of the array if they aren't specified and, if they are, using our ak.arrayIndex function to follow the Array.slice conventions of reading negative indices backwards from the end of the array and truncating them to the start and end of the array should they come before or after the latter respectively.
Note that we return minus one if the range is invalid, having its start after its end, rather than throw an exception since Array.slice doesn't throw one when passed such bounds and Array.indexOf uses minus one to mean not found, and so this struck me as the most consistent way to handle them.
Program 1 demonstrates this transformation by applying it to sorted ranges of integers, marking the end of the set with a vertical bar rather than a comma after it's done so.

Program 1: Transforming Multisets Into Sets

Note that if we apply ak.unique to unsorted ranges it will remove all but the first of any runs of equal elements within them.

A Merge Or Three

One of the most basic operations that we can perform on a pair of sorted ranges is to merge them into a single sorted array containing every element from both of them. Since the ranges are sorted, we can do this in a single pass by keeping indices into both of them, appending the element indexed in the first to the, initially empty, result if it is not greater than that in the second and then the element indexed in the second if it is not greater than that in the first, incrementing the indices from which we appended elements to the result. Finally, if one of the ranges still has any elements that have not been appended to the result once all in the other have been, they must be appended to it.
For example, marking indexed positions with a bar, using a plus sign to denote the merge and an arrow to point to the state of the result, the steps taken to merge the multisets \(\{1,2,2,3\}\) and \(\{2,3,3,4\}\) are
\[ \begin{align*} \{\bar{1},2,2,3\} &+ \{\bar{2},3,3,4\} \to \{1\}\\ \{1,\bar{2},2,3\} &+ \{\bar{2},3,3,4\} \to \{1,2,2\}\\ \{1,2,\bar{2},3\} &+ \{2,\bar{3},3,4\} \to \{1,2,2,2\}\\ \{1,2,2,\bar{3}\} &+ \{2,\bar{3},3,4\} \to \{1,2,2,2,3,3\}\\ \{1,2,2,3\bar{\}} &+ \{2,3,\bar{3},4\} \to \{1,2,2,2,3,3,3,4\} \end{align*} \]
This algorithm is implemented by the ak.merge function given in listing 3.

Listing 3: ak.merge
ak.merge = function(a1, a2, compare, start1, end1, start2, end2) {
 var n, out, i, c;

 if(ak.nativeType(compare)===ak.UNDEFINED_T) {
  compare = ak.alphaCompare;
 }
 else if(ak.nativeType(compare)!==ak.FUNCTION_T) {
  throw new Error('invalid comparator in ak.merge');
 }

 start1 = ak.arrayIndex(a1, start1, 'ak.merge');
 end1 = ak.arrayIndex(a1, end1, 'ak.merge');

 start2 = ak.arrayIndex(a2, start2, 'ak.merge');
 end2 = ak.arrayIndex(a2, end2, 'ak.merge');

 if(ak.nativeType(start1)===ak.UNDEFINED_T) start1 = 0;
 if(ak.nativeType(end1)===ak.UNDEFINED_T) end1 = a1.length;

 if(ak.nativeType(start2)===ak.UNDEFINED_T) start2 = 0;
 if(ak.nativeType(end2)===ak.UNDEFINED_T) end2 = a2.length;

 n = Math.max(end1-start1,0) + Math.max(end2-start2,0);
 out = new Array(n);

 i = 0;
 while(start1<end1 && start2<end2) {
  c = compare(a1[start1], a2[start2]);
  if(!(c>0)) out[i++] = a1[start1++];
  if(!(c<0)) out[i++] = a2[start2++];
 }

 if(start1<end1) {
  Array.prototype.splice.apply(out, [i, n-i].concat(a1.slice(start1, end1)));
 }
 else if(start2<end2) {
  Array.prototype.splice.apply(out, [i, n-i].concat(a2.slice(start2, end2)));
 }
 return out;
};

After the usual boilerplate code to check the validity of the arguments and provide defaults for those that aren't provided we create an array that is equal in size to the total number of elements in both ranges, treating invalid ranges as empty. Once the iterative part of the algorithm is complete we splice in any remaining elements from the one or the other of the ranges should there be any, replacing the elements that had not been copied to.
Program 2 demonstrates this function, defined in Merge.js, by using it to merge pairs of sorted arrays of random integers.

Program 2: Merging Multisets

We can use merging to recursively sort an array by splitting it in two, sorting the parts and merging them back into a single sorted array. Since arrays having zero or one elements are always sorted, we can simply make copies of them to yield new sorted arrays, terminating the recursion. Program 3 demonstrates this algorithm with the sort function.

Program 3: Sorting By Merging

Note that whilst this has the same \(O(n \ln n)\) average case computational complexity as the quicksort algorithm, it doesn't sort the array in-place and so requires an additional \(O(n \ln n)\) storage space, although this can be reduced to \(O(n)\) with the careful use of a pair of temporary buffers. It does have a couple of things in its favour, however; firstly, its best and worst case complexities are both equal to its average case complexity, unlike quicksort which has a worst case complexity of \(O\left(n^2\right)\), and secondly, it doesn't require random access to the elements. The latter property means that it can be modified to efficiently sort extremely large datasets using streaming disc storage rather than random access memory, albeit terminating the recursion once the number of elements is small enough to copy into an array which can then be sorted in-place.

Operations? High Time!

We can easily generalise set operations to multisets by augmenting each element with the number of times that it has appeared to its left in a list of its elements, treating differently augmented elements as distinct and finally removing those augmentations from the result.
For example, the union of the multisets \(\{2, 1, 3, 2\}\) and \(\{4, 2, 3, 3\}\) is
\[ \{2, 1, 3, 2\} \cup \{4, 2, 3, 3\} = \{2_0, 1_0, 3_0, 2_1\} \cup \{4_0, 2_0, 3_0, 3_1\} = \{2_0, 1_0, 3_0, 2_1, 4_0, 3_1\} = \{2, 1, 3, 2, 4, 3\} \]
and their intersection is
\[ \{2, 1, 3, 2\} \cap \{4, 2, 3, 3\} = \{2_0, 1_0, 3_0, 2_1\} \cap \{4_0, 2_0, 3_0, 3_1\} = \{2_0, 3_0\} = \{2, 3\} \]
Now we don't actually need to augment the elements of our multisets when implementing set operations since we can exploit the fact that we represent them with sorted ranges to make sure that their results will be equivalent to those that we'd have had if we did use augmentation.
For example, we can adapt our merge algorithm to generate the union by only appending one of the multisets' elements to the result when they compare equal, but incrementing both of their indices nevertheless
\[ \begin{align*} \{\bar{1},2,2,3\} &\cup \{\bar{2},3,3,4\} \to \{1\}\\ \{1,\bar{2},2,3\} &\cup \{\bar{2},3,3,4\} \to \{1,2\}\\ \{1,2,\bar{2},3\} &\cup \{2,\bar{3},3,4\} \to \{1,2,2\}\\ \{1,2,2,\bar{3}\} &\cup \{2,\bar{3},3,4\} \to \{1,2,2,3\}\\ \{1,2,2,3\bar{\}} &\cup \{2,3,\bar{3},4\} \to \{1,2,2,3,3,4\} \end{align*} \]
The ak.setUnion function given in listing 4 and implemented in SetUnion.js makes this change to the main loop from ak.merge.

Listing 4: ak.setUnion
ak.setUnion = function(a1, a2, compare, start1, end1, start2, end2) {
 var n, out, i, c;

 if(ak.nativeType(compare)===ak.UNDEFINED_T) {
  compare = ak.alphaCompare;
 }
 else if(ak.nativeType(compare)!==ak.FUNCTION_T) {
  throw new Error('invalid comparator in ak.setUnion');
 }

 start1 = ak.arrayIndex(a1, start1, 'ak.setUnion');
 end1 = ak.arrayIndex(a1, end1, 'ak.setUnion');

 start2 = ak.arrayIndex(a2, start2, 'ak.setUnion');
 end2 = ak.arrayIndex(a2, end2, 'ak.setUnion');

 if(ak.nativeType(start1)===ak.UNDEFINED_T) start1 = 0;
 if(ak.nativeType(end1)===ak.UNDEFINED_T) end1 = a1.length;

 if(ak.nativeType(start2)===ak.UNDEFINED_T) start2 = 0;
 if(ak.nativeType(end2)===ak.UNDEFINED_T) end2 = a2.length;

 n = Math.max(end1-start1, 0) + Math.max(end2-start2, 0);
 out = new Array(n);

 i = 0;
 while(start1<end1 && start2<end2) {
  c = compare(a1[start1], a2[start2]);
  if(c<0)      out[i++] = a1[start1++];
  else if(c>0) out[i++] = a2[start2++];
  else        {out[i++] = a1[start1++]; ++start2;}
 }

 if(start1<end1) {
  Array.prototype.splice.apply(out, [i, n-i].concat(a1.slice(start1, end1)));
 }
 else if(start2<end2) {
  Array.prototype.splice.apply(out, [i, n-i].concat(a2.slice(start2, end2)));
 }
 else out.length = i;
 return out;
};

Note that it is possible, even likely, that the result will have some elements that have not been copied to even if we have iterated over every element in both ranges and so we must be sure to truncate it to the length of the union, given by i, should that be the case.
Program 4 shows the results of using this function to calculate the unions of pairs of sorted arrays of random integers.

Program 4: Unions Of Multisets

To generate the intersection of two sorted ranges, we again iterate over indices into both of them, incrementing the first if the element it points to compares less than that the second points to, incrementing the second if it compares greater and only appending the first to the result if they compare equal, after which we should increment both indices, as done by ak.setIntersection in listing 5.

Listing 5: ak.setIntersection
ak.setIntersection = function(a1, a2, compare, start1, end1, start2, end2) {
 var n, out, i, c;

 if(ak.nativeType(compare)===ak.UNDEFINED_T) {
  compare = ak.alphaCompare;
 }
 else if(ak.nativeType(compare)!==ak.FUNCTION_T) {
  throw new Error('invalid comparator in ak.setIntersection');
 }

 start1 = ak.arrayIndex(a1, start1, 'ak.setIntersection');
 end1 = ak.arrayIndex(a1, end1, 'ak.setIntersection');

 start2 = ak.arrayIndex(a2, start2, 'ak.setIntersection');
 end2 = ak.arrayIndex(a2, end2, 'ak.setIntersection');

 if(ak.nativeType(start1)===ak.UNDEFINED_T) start1 = 0;
 if(ak.nativeType(end1)===ak.UNDEFINED_T) end1 = a1.length;

 if(ak.nativeType(start2)===ak.UNDEFINED_T) start2 = 0;
 if(ak.nativeType(end2)===ak.UNDEFINED_T) end2 = a2.length;

 n = Math.min(end1-start1, end2-start2);
 out = new Array(Math.max(n, 0));

 i = 0;
 while(start1<end1 && start2<end2) {
  c = compare(a1[start1], a2[start2]);
  if(c<0)      ++start1;
  else if(c>0) ++start2;
  else        {out[i++] = a1[start1++]; ++start2;}
 }
 out.length = i;
 return out;
};

Note that this time we create an array that is the same length as the shorter of the two ranges to store the result since the intersection cannot be larger than that, finally truncating it to the actual length of the intersection. Program 5 demonstrates this function, implemented in SetIntersection.js, by generating the intersections of pairs of sorted randomly generated arrays.

Program 5: Intersections Of Multisets

Another fundamental operation is the difference between two sets or multisets \(A\) and \(B\) which is the subset of \(A\) that does not include elements from \(B\) and is usually written as \(A \backslash B\), although I prefer \(A - B\). For example
\[ \{2, 1, 3, 2\} - \{4, 2, 3, 3\} = \{1, 2\} \]
Note that, since these are multisets, the result contains as many more of each element that are in the first set than are in the second. Related to this is the symmetric difference which gives the elements that are in one or another set or multiset but are not in both and may be written as \(A \ominus B\). We can define this in terms of the difference with
\[ A \ominus B = (A - B) \cup (B - A) \]
For example
\[ \begin{align*} \{2, 1, 3, 2\} \ominus \{4, 2, 3, 3\} &= \left(\{2, 1, 3, 2\} - \{4, 2, 3, 3\}\right) \cup \left(\{4, 2, 3, 3\} - \{2, 1, 3, 2\}\right)\\ &= \{1, 2\} \cup \{4, 3\}\\ &= \{1, 2, 4, 3\} \end{align*} \]
Once again, we can exploit the fact that we represent multisets with sorted ranges to efficiently implement this operation. This time, if the indexed element on the left hand side compares less than that on the right hand side then we append it to the result and increment its index. Next, if they compare equal then we leave the result as it is and increment both indices, effectively subtracting the element on the right hand side from the multiset on the left. Then, if it compares greater then we increment the right hand index since its element is therefore not a member of the left hand multiset. Finally, if we reach the end of the right hand multiset before that of the left then we must append all of the remaining elements in the latter to the result.
For example
\[ \begin{align*} \{\bar{1},2,2,3\} &- \{\bar{2},3,3,4\} \to \{1\}\\ \{1,\bar{2},2,3\} &- \{\bar{2},3,3,4\} \to \{1\}\\ \{1,2,\bar{2},3\} &- \{2,\bar{3},3,4\} \to \{1,2\}\\ \{1,2,2,\bar{3}\} &- \{2,\bar{3},3,4\} \to \{1,2\}\\ \{1,2,2,3\bar{\}} &- \{2,3,\bar{3},4\} \to \{1,2\} \end{align*} \]
and
\[ \begin{align*} \{\bar{2},3,3,4\} &- \{\bar{1},2,2,3\} \to \{\}\\ \{\bar{2},3,3,4\} &- \{1,\bar{2},2,3\} \to \{\}\\ \{2,\bar{3},3,4\} &- \{1,2,\bar{2},3\} \to \{\}\\ \{2,\bar{3},3,4\} &- \{1,2,2,\bar{3}\} \to \{\}\\ \{2,3,\bar{3},4\} &- \{1,2,2,3\bar{\}} \to \{3,4\} \end{align*} \]
The ak.setDifference function given in listing 6 and defined in SetDifference.js implements this algorithm, initialising the resulting array to the length of the left hand range since it clearly cannot be any larger than it.

Listing 6: ak.setDifference
ak.setDifference = function(a1, a2, compare, start1, end1, start2, end2) {
 var n, out, i, c;

 if(ak.nativeType(compare)===ak.UNDEFINED_T) {
  compare = ak.alphaCompare;
 }
 else if(ak.nativeType(compare)!==ak.FUNCTION_T) {
  throw new Error('invalid comparator in ak.setDifference');
 }

 start1 = ak.arrayIndex(a1, start1, 'ak.setDifference');
 end1 = ak.arrayIndex(a1, end1, 'ak.setDifference');

 start2 = ak.arrayIndex(a2, start2, 'ak.setDifference');
 end2 = ak.arrayIndex(a2, end2, 'ak.setDifference');

 if(ak.nativeType(start1)===ak.UNDEFINED_T) start1 = 0;
 if(ak.nativeType(end1)===ak.UNDEFINED_T) end1 = a1.length;

 if(ak.nativeType(start2)===ak.UNDEFINED_T) start2 = 0;
 if(ak.nativeType(end2)===ak.UNDEFINED_T) end2 = a2.length;

 n = Math.max(end1-start1, 0)
 out = new Array(n);

 i = 0;
 while(start1<end1 && start2<end2) {
  c = compare(a1[start1], a2[start2]);
  if(c<0)      out[i++] = a1[start1++];
  else if(c>0) ++start2;
  else        {++start1; ++start2;}
 }

 if(start1<end1) {
  Array.prototype.splice.apply(out, [i, n-i].concat(a1.slice(start1, end1)));
 }
 else {
  out.length = i;
 }
 return out;
};

As usual, program 6 demonstrates this function by applying it to sorted arrays of randomly generated integers.

Program 6: Difference Between Multisets

To adapt this algorithm to the symmetric difference we simply append the element indexed on the right hand side to the result before incrementing its index when that on the left hand side compares greater than it, making sure that we also append every remaining element on the right hand side to the result if we reach the end of the left hand side first, as done by ak.symmetricDifference in listing 7.

Listing 7: ak.setSymmetricDifference
ak.setSymmetricDifference = function(a1, a2, compare, start1, end1, start2, end2) {
 var n, out, i, c;

 if(ak.nativeType(compare)===ak.UNDEFINED_T) {
  compare = ak.alphaCompare;
 }
 else if(ak.nativeType(compare)!==ak.FUNCTION_T) {
  throw new Error('invalid comparator in ak.setSymmetricDifference');
 }

 start1 = ak.arrayIndex(a1, start1, 'ak.setSymmetricDifference');
 end1 = ak.arrayIndex(a1, end1, 'ak.setSymmetricDifference');

 start2 = ak.arrayIndex(a2, start2, 'ak.setSymmetricDifference');
 end2 = ak.arrayIndex(a2, end2, 'ak.setSymmetricDifference');

 if(ak.nativeType(start1)===ak.UNDEFINED_T) start1 = 0;
 if(ak.nativeType(end1)===ak.UNDEFINED_T) end1 = a1.length;

 if(ak.nativeType(start2)===ak.UNDEFINED_T) start2 = 0;
 if(ak.nativeType(end2)===ak.UNDEFINED_T) end2 = a2.length;

 n = Math.max(end1-start1,0) + Math.max(end2-start2,0);
 out = new Array(n);

 i = 0;
 while(start1<end1 && start2<end2) {
  c = compare(a1[start1], a2[start2]);
  if(c<0)      out[i++] = a1[start1++];
  else if(c>0) out[i++] = a2[start2++];
  else        {++start1; ++start2;}
 }

 if(start1<end1) {
  Array.prototype.splice.apply(out, [i, n-i].concat(a1.slice(start1, end1)));
 }
 else if(start2<end2) {
  Array.prototype.splice.apply(out, [i, n-i].concat(a2.slice(start2, end2)));
 }
 else {
  out.length = i;
 }
 return out;
};

Note that this time we initialise the length of the result to the sum of the lengths of the two ranges since it will contain that many elements if the two ranges have none in common.
Program 7 demonstrates this function, implemented in SetSymmetricDifference.js, in the same way that we have for the previous operations.

Program 7: Symmetric Difference Between Multisets

The last set operation that we shall implement is determining whether one set or multiset contains every element within another, or in other words whether the latter is a subset of the former. We can do this by again adapting the difference operation; if we reach indices such that the left hand element compares greater than the right hand element, or if we reach the end of the left hand range before we reach the end of the right hand one, then the former does not contain the latter, otherwise it does.
This is precisely the approach taken by ak.includes in listing 8, as defined in Includes.js.

Listing 8: ak.includes
ak.includes = function(a1, a2, compare, start1, end1, start2, end2) {
 var c;

 if(ak.nativeType(compare)===ak.UNDEFINED_T) {
  compare = ak.alphaCompare;
 }
 else if(ak.nativeType(compare)!==ak.FUNCTION_T) {
  throw new Error('invalid comparator in ak.includes');
 }

 start1 = ak.arrayIndex(a1, start1, 'ak.includes');
 end1 = ak.arrayIndex(a1, end1, 'ak.includes');

 start2 = ak.arrayIndex(a2, start2, 'ak.includes');
 end2 = ak.arrayIndex(a2, end2, 'ak.includes');

 if(ak.nativeType(start1)===ak.UNDEFINED_T) start1 = 0;
 if(ak.nativeType(end1)===ak.UNDEFINED_T) end1 = a1.length;

 if(ak.nativeType(start2)===ak.UNDEFINED_T) start2 = 0;
 if(ak.nativeType(end2)===ak.UNDEFINED_T) end2 = a2.length;

 while(start1<end1 && start2<end2 && !((c=compare(a1[start1],a2[start2]))>0)) {
  ++start1;
  if(!c) ++start2;
 }
 return start1<=end1 && start2===end2;
};

Now I must admit that this doesn't look much like the original operation and so I think that it could bear some explanation. Let's begin with the implementation of the set difference algorithm
i = 0; while(start1<end1 && start2<end2) { c = compare(a1[start1], a2[start2]); if(c<0) out[i++] = a1[start1++]; else if(c>0) ++start2; else {++start1; ++start2;} } if(start1<end1) Array.prototype.splice.apply(out, [i, n-i].concat(a1.slice(start1, end1))); else out.length = i; return out;

Since we're only interested in whether or not we reach the end of the second range without finding an element in it that compares less than the currently indexed element in the first, we can remove all of the references to the out array to yield
while(start1<end1 && start2<end2) { c = compare(a1[start1], a2[start2]); if(c<0) ++start1; else if(c>0) return false; else {++start1; ++start2;} } return start1<=end1 && start2===end2;

Here we immediately return false if we find such an element and ensure that the result will be false if either range is invalid with the final return statement.
Now we always increment the first index if we don't return from within the loop and so we can rearrange this into
while(start1<end1 && start2<end2) { c = compare(a1[start1], a2[start2]); if(c>0) return false; ++start1; if(!c) ++start2; } return start1<=end1 && start2===end2;

Next, we can factor out the early return by terminating the loop if we hit its condition with
while(start1<end1 && start2<end2 && !(compare(a1[start1],a2[start2])>0)) { ++start1; if(!compare(a1[start1],a2[start2])) ++start2; } return start1<=end1 && start2===end2;

so that the second index will not point to the end of the second range if it does.
Finally, we can avoid repeating the comparison in the body of the loop by saving its result from the condition with
while(start1<end1 && start2<end2 && !((c=compare(a1[start1],a2[start2]))>0)) { ++start1; if(!c) ++start2; } return start1<=end1 && start2===end2;

Program 8 demonstrates this function by using it to check whether or not a sorted array of random integers contains a shorter one.

Program 8: Multisets Including Multisets

And with that we are done, at least for now, with translating functions from the C++ standard algorithm library into JavaScript.

References

[1] We're All Sorted From A To Z, www.thusspakeak.com, 2017.

[2] I Still Haven't Found What I'm Looking For, www.thusspakeak.com, 2017.

Leave a comment