You’re Going To Have To Think! Why Arbitrary Precision Arithmetic Won’t Cure Your Floating Point Blues

| 0 Comments

Fixed point arithmetic's susceptibilities to rounding error and overflow both stem from the limited number of decimal digits available to represent them, as discussed in the previous post[1]. A common solution to this problem is to use arbitrary precision integers, also known as bignums, to represent the digits.
These are typically implemented using an array of integers which can grow as needed, which is more or less the approach I've taken in ak.bignum, given in listing 1.

Listing 1: ak.bignum
ak.BIGNUM_T = 'ak.bignum';

var BASE = ak.DEC_MAX+1;

function Bignum(){}
Bignum.prototype = {TYPE: ak.BIGNUM_T, valueOf: function(){return ak.NaN;}};

ak.bignum = function() {
 var b     = new Bignum();
 var arg0  = arguments[0];
 var state = [];

 constructors[ak.nativeType(arg0)](state, arg0, arguments);
 normalise(state);

 b.size = function()  {return state.length;};
 b.at   = function(i) {return state[i];};

 b.toArray  = function() {return state.slice(0);};
 b.toNumber = function() {return toNumber(state);};
 b.toString = function() {return toString(state);};

 return Object.freeze(b);
};

var constructors = {};

Here the state is that array of integers or, more accurately, the closest thing to it that JavaScript has to offer.
Each integer element of the array represents a single digit of the bignum, albeit expressed in a base much much larger than those we are accustomed to. In the context of decimal fixed point numbers it's convenient to make that base a power of 10, so I have chosen ak.DEC_MAX+1 or 1015, as stored in the BASE variable.

As with ak.fixed, object creation involves initialisation via dynamic dispatch to a constructors object, followed by validation of state with normalise, dynamic addition of type conversion and member access methods and finally freezing it so that it can't be changed.

ak.bignum Constructors

The ak.bignum constructors are a little more complex than those of ak.fixed so it's worth dealing with them one at a time.
First of all we deal with initialisation with arrays, as shown in listing 2.

Listing 2: ak.bignum Array Constructors
constructors[ak.ARRAY_T] = function(state, arr, args) {
 var n = arr.length;
 var arg1 = args[1];
 var i;

 for(i=0;i<n;++i) state[i] = ak.trunc(arr[i]);

 constructors[ak.ARRAY_T][ak.nativeType(arg1)](state, arg1);
};

constructors[ak.ARRAY_T][ak.NUMBER_T] = function(state, s) {
 var n, i;

 if(s<0) {
  n = state.length;
  for(i=0;i<n && state[i]===0;++i);
  if(i<n) state[i] *= -1;
 }
};

constructors[ak.ARRAY_T][ak.UNDEFINED_T] = function() {
};

So the array constructor simply fills the state with truncated copies of the elements in the argument. If any of those elements are not numbers, the state will contain NaNs which we will deal with when we validate it.
If we pass an optional second numeric argument to the constructor and if it is negative, then the first non-zero element of the state, if any, will be negated.

Next up is initialisation with numbers, as illustrated by listing 3.

Listing 3: ak.bignum Number Constructor
constructors[ak.NUMBER_T] = function(state, x) {
 var s = x<0 ? -1 : 1;
 x = ak.trunc(s*x);

 if(x<BASE) {
  state[0] = s*x;
 }
 else if(x<=ak.INT_MAX) {
  state[1] = x % BASE;
  state[0] = s * (x-state[1]) / BASE;
 }
 else {
  state[0] = s*x;
 }
};

We store the sign of the argument, which is then replaced it with its truncated absolute value. If the result can be represented by a single digit then that's what we do, after putting the sign back.
If it can't, but is still within the range of exact integers, we split the value into two digits and put the sign back into the first, most significant, digit.
Finally, if it is bigger than the maximum integer value we put the original value in the state to be dealt with during validation.

Construction from objects is provided by listing 4.

Listing 4: ak.bignum Object Constructor
constructors[ak.OBJECT_T] = function(state, obj) {
 var n = obj.size;
 var i;

 n = (ak.nativeType(n)===ak.FUNCTION_T) ? n = Number(n()) : Number(n);
 state.length = n;

 for(i=0;i<n;++i) state[i] = ak.trunc(obj.at(i));
};

Here, we assume that the object supports an at element access method, as will ak.bignum. We truncate the elements to ensure that the state only contains integers, but defer handling oversized values to the validation process.

Finally, since representing bignums with arrays is a little cumbersome, ak.bignum supports construction from strings, as shown in listing 5.

Listing 5: ak.bignum String Constructor
constructors[ak.STRING_T] = function(state, str) {
 var s = str.match(/^\s*([+-]?)\s*(\d{1,3}(?:,?\d{3})*)\s*$/);
 var n, c0, c1, i;

 if(s) {
  str = s[2].replace(/,/g, '');
  n = str.length;

  c0 = 0;
  c1 = n % ak.DEC_DIG;
  if(c1===0) c1 = ak.DEC_DIG;

  for(i=0;c0<n;++i) {
   state[i] = parseInt(str.substring(c0, c1), 10);

   c0  = c1;
   c1 += ak.DEC_DIG;
  }
  if(s[1]==='-') state[0] *= -1;
 }
 else {
  state[0] = ak.NaN; 
 }
};

This is most complex of the constructors, relying as it does upon Linear A regular expressions.
The regular expression in question strips leading whitespace, captures an optional plus or minus symbol, strips subsequent whitespace, captures one to three decimal digits followed by groups of three decimal digits optionally preceded by a comma and finally strips trailing whitespace, failing to match anything that doesn't have this number-like syntax. Or at least I very much hope it does.
If a match is made the commas are stripped from the digits string and it is chopped up into chunks that are converted to ak.bignum digits, the leading of which is negated if required. Otherwise the state is assigned a single NaN.

After construction the state is validated with the normalise function, given in listing 6.

Listing 6: ak.bignum normalise
function normalise(state) {
 var n = state.length;
 var i;

 if(n===1) {
  if(isFinite(state[0]) && Math.abs(state[0])>=BASE) state[0] = ak.NaN;
 }
 else if(n>0) {
  for(i=0;i<n && state[i]===0;++i);
  if(i>0) {
   state.splice(0, i);
   n -= i;
  }

  for(i=1;i<n && state[i]>=0 && state[i]<BASE;++i);
  if(i<n || (n>0 && !(Math.abs(state[0])<BASE))) {
   n = 1;
   state.length = n;
   state[0] = ak.NaN;
  }
 }

 if(n===0) state[0] = 0;
}

If the state has a single digit which is finite but too large to be represented by an ak.bignum digit, it is replaced with a NaN.
Otherwise, if there are multiple digits then leading zeros, if any, are stripped and if any remaining non-leading digits are negative or are not less than the base, or if the absolute value of the leading digit, if there is one, is not less than the base the state is again replaced with a NaN.
Finally, if there are no digits left, the state is assigned a single zero.

Note that comparisons of the form !(x<y), as used here, are typical for ensuring correct behaviour in the face of NaNs; if either x or y are NaNs, !(x<y) will return true whilst x>=y will return false. By carefully choosing which form of comparison we use, we can control whether NaNs pass or fail.

ak.bignum Methods

The size and at property accessors and the toArray type conversion are straightforward. The toNumber conversion is given in listing 7.

Listing 7: ak.bignum toNumber
function toNumber(state) {
 var n = state.length;
 var x = state[0];
 var s = x<0 ? -1 : 1;
 var i;

 x *= s;
 for(i=1;i<n;++i) {
  x *= BASE;
  x += state[i];
 }
 return s*x;
}

Once again we work with the absolute value for simplicity, saving the sign so we can put it back at the end of the calculation. The result, x, is initially set to the absolute value of the leading digit and, as we iterate over the trailing digits, multiplied by the bignum base before each is added, after which the sign is replaced. This is pretty much exactly how we read decimal numbers, just on a much bigger scale.
It should be noted that, since the conversion begins with the leading digit, it is possible that the result will be incorrectly rounded. This happens since as we add each digit we are implicitly treating it as if it were followed by zeros rather than the remaining digits. It would be more accurate to work from the smallest to the largest digit instead, but I don't think that this is a significant enough problem to warrant the additional complexity.

The toString function is provided by listing 8.

Listing 8: ak.bignum toString
function toString(state) {
 var n = state.length;
 var s = [];
 var i, t;

 s.push(state[0].toString());

 for(i=1;i<n;++i) {
  t = state[i]+BASE;
  s.push(t.toFixed(0).slice(1));
 }
 return s.join('');
}

Here we use the trick of adding the bignum base to each trailing digit to ensure that their leading zeros are captured in the string representation by removing the leading character that it introduces. We put these strings into an array which we finally concatenate to yield the result. The simplicity of this function was my main motivation for using a power of ten for the base.

ak.bignum Overloads

The simplest overloads are again ak.abs and ak.neg, given in listing 9.

Listing 9: ak.bignum abs And neg
function abs(b) {
 return b.at(0)<0 ? neg(b) : b;
}
ak.overload(ak.abs, ak.BIGNUM_T, abs);

function neg(b) {
 var a = b.toArray();
 a[0] *= -1;
 return ak.bignum(a);
}
ak.overload(ak.neg, ak.BIGNUM_T, neg);

Next up are the comparison operators, with eq and ne shown in listing 10.

Listing 10: ak.bignum eq And ne
function eq(b0, b1) {
 var n = b0.size();
 var i;

 if(b1.size()!==n) return false;
 for(i=0;i<n && b0.at(i)===b1.at(i);++i);
 return i===n;
}
ak.overload(ak.eq, [ak.BIGNUM_T, ak.BIGNUM_T], eq);

function ne(b0, b1) {
 return !eq(b0, b1);
}
ak.overload(ak.ne, [ak.BIGNUM_T, ak.BIGNUM_T], ne);

Trivially, if the digits are the same then the bignums are the equal, otherwise they are not equal. As ne is implemented with eq, so ordering comparisons, such as le, are implemented with cmp, given in listing 11.

Listing 11: ak.bignum cmp
function cmp(b0, b1) {
 var d0 = b0.at(0);
 var d1 = b1.at(0);
 var dd = d0-d1;
 var n0, n1, s, i;

 if(!isFinite(dd)) return d0!==d1 ? dd : 0;

 n0 = b0.size();
 n1 = b1.size();

 if(n0!==n1) return n0>n1 ? d0 : -d1;
 if(dd!==0)  return dd;

 s = d0<0 ? -1 : 1;

 for(i=1;i<n0 && b0.at(i)===b1.at(i);++i);
 return i<n0 ? s*(b0.at(i)-b1.at(i)) : 0;
}
ak.overload(ak.cmp, [ak.BIGNUM_T, ak.BIGNUM_T], cmp);

This is where things start to get a little bit more complicated.
We deal with corner cases with early returns, the first being a test for infinities and NaNs. If the difference between the leading digits isn't finite then one or both of them is an infinity or NaN. If the leading digits are unequal we return the difference, otherwise we return zero. This is necessary because the difference between two equal infinities is NaN rather than zero.
Next, if the bignums have different numbers of digits we return the result of comparing the leading digit of the larger with zero in the position of the smaller; i.e. either d0-0 or 0-d1.
Now that we have two bignums of the same number of digits, all we need to do is find the first digit at which they differ, if they differ at all that is. We've already calculated the difference between the leading digits and if it's not zero we simply return it. If it is, we iterate over all of the remaining equal digits to get the index of the first digit at which the bignums differ. If we reach the end of the digits without finding a difference, the bignums are trivially equal and we return zero. Otherwise we return the difference between the digits we found. Crucially we must multiply them by the signs of the leading digits, which must be the same at this point, since trailing digits do not have signs in our implementation.

Continuing in order of complexity we come to addition and subtraction, with ak.add shown in listing 12.

Listing 12: ak.bignum add
function add(b0, b1) {
 var n0 = b0.size();
 var n1 = b1.size();
 var d0 = b0.at(0);
 var d1 = b1.at(0);
 var s0 = d0<0 ? -1 : 1;
 var s1 = d1<0 ? -1 : 1;
 var sum;

 if(n0===1) {
  if(n1===1) return ak.bignum(d0+d1);

  if(d0===0) return b1;
  if(!isFinite(d0)) return b0;
 }

 if(n1===1) {
  if(d1===0) return b0;
  if(!isFinite(d1)) return b1;
 }

 if(s0===s1) {
  sum = n0>n1 ? addAbs(b0.toArray(), b1) : addAbs(b1.toArray(), b0);
 }
 else {
  if(n0===n1) n0 += cmpAbs(b0, b1);
  if(n0!==n1) sum = n0>n1 ? subAbs(b0.toArray(), b1) : subAbs(b1.toArray(), b0);
  else        sum = 0;
 }

 return ak.bignum(sum, n0>n1 ? s0 : s1);
}
ak.overload(ak.add, [ak.BIGNUM_T, ak.BIGNUM_T], add);

We deal with the corner cases first again. First of all, if the both arguments have one digit we construct a bignum with their sum (recall that the sum of two decimal integers is exactly representable so we don't need to worry about rounding here). If just the first argument has one digit then if it is zero we return the second argument and if it isn't finite we return the first. We then perform these steps for the second argument.
If both arguments have the same sign then we compute the sum of their absolute values (in-place on an array of the digits of the argument with more of them) with a call to addAbs, which I'll describe later. If not we determine which argument has the greater absolute value. This is trivially done by comparing the number of digits if they are not equal. If they have the same number of digits we adjust the first count with an absolute version of the comparison function, cmpAbs. We then in-place subtract the absolute value of the smaller from that of the larger with subAbs.
Finally, we use the sign-switching array constructor to adjust the sign of the result. Note that the second argument is ignored if the result is the number zero rather than an array.

The absolute comparison function is more or less the same as cmp as shown in listing 13.

Listing 13: ak.bignum cmpAbs
function cmpAbs(b0, b1) {
 var d = Math.abs(b0.at(0)) - Math.abs(b1.at(0));
 var n, i;

 if(d!==0) return d;

 n = b0.size();
 for(i=1;i<n && b0.at(i)===b1.at(i);++i);
 return i<n ? b0.at(i)-b1.at(i) : 0;
}

Absolute addition is a little more complex, as shown in listing 14.

Listing 14: ak.bignum addAbs
function addAbs(a, b) {
 var na = a.length;
 var nb = b.size();
 var i  = 1;
 var c  = 0;
 var x;

 a[0] = Math.abs(a[0]);

 while(i<nb) {
  x = a[na-i] + b.at(nb-i) + c;
  c = x<BASE ? 0 : 1;
  a[na-i++] = x - c*BASE;
 }

 x = a[na-i] + Math.abs(b.at(0)) + c;
 c = x<BASE ? 0 : 1;
 a[na-i++] = x - c*BASE;

 while(i<=na && c===1) {
  x = a[na-i] + c;
  c = x<BASE ? 0 : 1;
  a[na-i++] = x - c*BASE;
 }

 if(c===1) a.unshift(1);
 return a;
}

This is essentially how we would perform this calculation on paper. Note that the first argument, the array of digits with which we in-place perform the addition, must have at least as many elements as the second has digits. This means that we can be sure that iterating over the second argument's digits cannot lead us to run out of elements in the first.
We set the first element of the array, the most significant digit, to its absolute value so that the array contains the digits of the absolute value of its originating bignum. We then iterate backwards over all but the most significant of the digits of the second argument, adding the sum of equally significant digits of both arguments to the carry, c. We compute the next carry by comparing the result to the bignum base and then replace the relevant element of the array with the result modulo the base, itself computed by subtracting the carry times the base. The same calculation is then performed for the most significant digit of the second argument with the exception that its absolute value is used. Finally, we iterate over the remaining elements of the array adding the carry, so long as there is one. If we have a carry left over at the end we push it onto the front of the array with unshift before returning it.

Absolute subtraction is similar, as demonstrated in listing 15.

Listing 15: ak.bignum subAbs
function subAbs(a, b) {
 var na = a.length;
 var nb = b.size();
 var i  = 1;
 var c  = 0;
 var x;

 a[0] = Math.abs(a[0]);

 while(i<nb) {
  x = a[na-i] - b.at(nb-i) - c;
  c = x>=0 ? 0 : 1;
  a[na-i++] = x + c*BASE;
 }

 x = a[na-i] - Math.abs(b.at(0)) - c;
 c = x>=0 ? 0 : 1;
 a[na-i++] = x + c*BASE;

 while(c===1) {
  x = a[na-i] - c;
  c = x>=0 ? 0 : 1;
  a[na-i++] = x + c*BASE;
 }

 return a;
}

Again, this is essentially how we would do the calculation by hand. This time it is important to note that not only must the first argument have no fewer elements than the second has digits but also that absolute value of its originating bignum can be no smaller than the absolute value of the second and that the result of the subtraction cannot therefore be negative. As a result, we don't need to handle a final carry. Aside from this the operation is essentially equivalent to that of addAbs.
The implementation of the sub overload is equally essentially equivalent to that of add so I don't think it's worth delving into. Program 1 compares bignum addition and subtraction to that of floating point.

Program 1: Bignum Addition And Subtraction

Clearly bignum addition and subtraction have greater precision than floating point, although you'll have to resort to pencil and paper to confirm that the results are correct...

Multiplication is trickier still. For starters JavaScript's Number type doesn't have enough precision to exactly represent all possible products of two decimal integers. To exactly multiply decimal integers we'll need to split them up into smaller pieces whose products can be exactly represented. For example
\[ \begin{align*} &123,456,789,876,543 \times 234,567,890,987,654 \\ &\quad = (12,345 \times 10^{10} + 67,898 \times 10^5 + 76,543) \times (23,456 \times 10^{10} + 78,909 \times 10^5 + 87,654) \\ &\quad = (12,345 \times 23,456) \times 10^{20} \\ &\quad \quad + (12,345 \times 78,909 + 67,898 \times 23,456) \times 10^{15} \\ &\quad \quad + (12,345 \times 87,654 + 67,898 \times 78,909 + 76,543 \times 23,456) \times 10^{10} \\ &\quad \quad + (67,898 \times 87,654 + 76,543 \times 78,909) \times 10^5 \\ &\quad \quad + (76,543 \times 87,654) \end{align*} \]
The result of this sum won't fit into a Number, but it will fit into two if we interpret them as
\[ x_\mathrm{hi} \times 10^{15} + x_\mathrm{lo} \]
By carefully keeping track of the terms and the powers of ten by which they should be multiplied and with judicious use of remainder and division operations, ak.mulDec returns just such a pair, as shown in listing 16.

Listing 16: ak.mulDec
ak.mulDec = function(lhs, rhs) {
 var ls, l1, l2, l3;
 var rs, r1, r2, r3;
 var h, m, m1, m2, l, c, s;

 ls = lhs<0 ? -1 : 1;
 rs = rhs<0 ? -1 : 1;

 lhs = Math.floor(ls*lhs);
 rhs = Math.floor(rs*rhs);

 if(lhs>=1E15) lhs = ak.INFINITY;
 if(rhs>=1E15) rhs = ak.INFINITY;

 l = lhs*rhs;
 if(!isFinite(l)) {
  if(isNaN(l)) return {sign: ak.NaN, hi: ak.NaN, lo: ak.NaN};
  return {sign: ls*rs, hi: ak.INFINITY, lo: ak.INFINITY};
 }

 if(l<1E15) {
  h = 0;
 }
 else {
  l3 = lhs%1E5; l2 = ((lhs-l3)/1E5)%1E5; l1 = (lhs-l2*1E5-l3)/1E10;
  r3 = rhs%1E5; r2 = ((rhs-r3)/1E5)%1E5; r1 = (rhs-r2*1E5-r3)/1E10;

  h = l1*r1*1E5 + l1*r2 + l2*r1;
  m = l1*r3 + l2*r2 + l3*r1;
  l = (l2*r3+l3*r2)*1E5 + l3*r3;

  m2 = m%1E5;
  m1 = (m-m2)/1E5;

  h += m1;
  l += m2*1E10;

  if(l>=1E15) {
   c = l%1E15;
   h += (l-c)/1E15;
   l = c;
  }
 }
 s = (h!==0 || l!==0) ? ls*rs : 0;
 return {sign: s, hi: h, lo: l};
};

In ak.mulDec we work with truncated absolute values of the arguments for simplicity, storing the signs for later reintroduction. As usual, we deal with corner cases first, returning early for NaNs and infinities. Hereafter, we'll put the lower order term in the variable l and the higher in h. If the product of the arguments doesn't involve rounding, if it is less than \(10^{15}\), then we are done and we leave it in l and set h to zero. If not we have to slice up the arguments and carefully assemble the result.
The important thing to note about this code is that if we subtract the remainder modulo some divisor from a value then subsequent division by it is exact. Having so noted, the only tricky bits are making sure that we correctly split the term of order \(10^{10}\), here stored in m, between h and l and correctly handle the carry, if any, in the lower order term.

Armed with exact decimal multiplication, it's a relatively simple job to implement long multiplication for bignums, as illustrated in listing 17.

Listing 17: ak.bignum mul
function mul(b0, b1) {
 var n0 = b0.size();
 var n1 = b1.size();
 var n  = n0+n1;
 var d0 = b0.at(0);
 var d1 = b1.at(0);
 var s0 = d0<0 ? -1 : 1;
 var s1 = d1<0 ? -1 : 1;
 var prod = [];
 var i, j, k, c, x;

 if(n0===1) {
  if(d0=== 0 || !isFinite(d0)) return ak.bignum(d0*d1);
  if(d0=== 1) return b1;
  if(d0===-1) return neg(b1);
 }

 if(n1===1) {
  if(d1=== 0 || !isFinite(d1)) return ak.bignum(d0*d1);
  if(d1=== 1) return b0;
  if(d1===-1) return neg(b0);
 }

 for(i=0;i<n;++i) prod[i] = 0;

 for(i=0;i<n0;++i) {
  for(j=0;j<n1;++j) {
   k = i+j+1;

   x = ak.mulDec(b0.at(i), b1.at(j));
   x.lo += prod[k];
   x.hi += prod[k-1];

   if(x.lo>=BASE) {
    x.lo -= BASE;
    x.hi += 1;
   }

   c = x.hi<BASE ? 0 : 1;
   prod[k--] = x.lo;
   prod[k--] = x.hi - c*BASE;

   while(c===1) {
    x = prod[k] + c;
    c = x<BASE ? 0 : 1;
    prod[k--] = x - c*BASE;
   }
  }
 }
 return ak.bignum(prod, s0*s1);
}
ak.overload(ak.mul, [ak.BIGNUM_T, ak.BIGNUM_T], mul);

This is more or less the long multiplication technique that we were taught in school, the essential difference being that the second number is handled digit by digit too rather than as a whole.
We have corner cases for infinities, NaNs, zeros and units, all but the last of which we handle by multiplying the first digits to be sure that NaNs and signs are correctly propagated. After we've dealt with them we get to the meat of the calculation. The key observation is that the \(i^{th}\) least significant digit has a multiplicative factor of \(10^i\) (counting from 0, that is), so the \(i^{th}\) of the first times the \(j^{th}\) of the second has one of \(10^{i+j}\). A little care must be taken since the digits of our bignums are ordered from most to least significant rather than the reverse, but apart from this all that's required is to correctly handle the carries when adding the results of ak.mulDec to the appropriate elements of the result.

Program 2 compares bignum and floating point multiplication, although you'll have your work cut out for you validating the correctness of the former by hand.

Program 2: Bignum Multiplication

Note that this is an extremely naive approach to multiplication and is vastly less efficient than the algorithms that are routinely used in practice. It does have the advantage of being a good deal simpler than those algorithms but no one in their right mind would use it for real work. That I have chosen it is principally because the purpose of the ak library is exposition, not practical application, and I don't think that that would be helped by getting into the more complex aspects of numerical computing at this point. Besides, I'm implementing a numerical library in JavaScript of all things, so perhaps holding me to the standards of the right-minded isn't entirely appropriate.

Division is the most difficult of the basic arithmetic operations and, in keeping with my implementation of multiplication, I'll be using the schooldays method of long division.
For example, consider dividing 4,753 by 68.
\[ 68 \, \overline{\big) \, 4,735} \]
At first glance it looks like the could be a 7 in the tens column since 47 ÷ 6 ≈ 7.8, but 68 × 7 = 4,760 so it must be a 6.
\[ \phantom{68 \, \big) \, 4,7}6\\ 68 \, \overline{\big) \, 4,735}\\ \phantom{68 \, \big) \, }\underline{4,080}\\ \phantom{68 \, \big) \, 4 ,}655 \]
Now it's pretty obvious that there should be a 9 in the units column, giving
\[ \phantom{68 \, \big) \, 4,7}69\\ 68 \, \overline{\big) \, 4,735}\\ \phantom{68 \, \big) \, }\underline{4,080}\\ \phantom{68 \, \big) \, 4,}655\\ \phantom{68 \, \big) \, 4,}\underline{612}\\ \phantom{68 \, \big) \, 4,6}43 \]
The correction we had to make to from our initial estimate of 7 for the tens column is an irritating feature of this algorithm. Fortunately it is trivially fixed if we simply allow negative remainders.
\[ \phantom{68 \, \big) \, }0,070\\ 68 \, \overline{\big) \, 4,735}\\ \phantom{68 \, \big) \, }\underline{4,760}\\ \phantom{68 \, \big) \, \,}-25 \]
Note that we're no longer setting the digits of the ratio as such, but rather accumulating corrections to it.
With this version of the technique we have a remainder that has a smaller absolute value after just one step, but unfortunately it's negative. To get a positive remainder we must add the divisor to it or, in other words, subtract one from the ratio.
This is exactly the approach used by ak.div, as shown in listing 18.

Listing 18: ak.bignum div
function div(b0, b1) {
 var n0 = b0.size();
 var n1 = b1.size();
 var n  = n0-n1+1;
 var d0 = b0.at(0);
 var d1 = b1.at(0);
 var s0 = d0<0 ? -1 : 1;
 var s1 = d1<0 ? -1 : 1;
 var rat, i, c;

 if(n0===1) {
  if(n1===1) return ak.bignum(ak.trunc(d0/d1));
  if(d0===0 || !isFinite(d0)) return ak.bignum(d0/d1);
 }

 if(n1===1) {
  if(d1=== 0 || !isFinite(d1)) return ak.bignum(d0/d1);
  if(d1=== 1) return b0;
  if(d1===-1) return neg(b0);
 }

 b0  = abs(b0);
 b1  = abs(b1);
 d1 *= s1;

 rat = ak.bignum(0);

 while(n0>n1 || (n0===n1 && cmpAbs(b0, b1)>=0)) {
  d0 = b0.at(0);

  if(Math.abs(d0)>=d1) {
   c = [ak.trunc(d0/d1)];
   for(i=1;i<=n0-n1;++i) c[i] = 0;
  }
  else {
   c = [ak.trunc(BASE*d0/d1)];
   for(i=1;i<n0-n1;++i) c[i] = 0;
  }
  c   = ak.bignum(c);
  rat = add(rat, c);
  b0  = sub(b0, ak.mul(b1, c));
  n0  = b0.size();
 }
 if(b0.at(0)<0) rat = sub(rat, ak.bignum(1));
 return s0*s1<0 ? neg(rat) : rat;
}
ak.overload(ak.div, [ak.BIGNUM_T, ak.BIGNUM_T], div);

As usual we handle the corner cases first, in much the same way as we have for other operations. Having done so, we again subsequently work with the absolute values of the arguments, replacing the sign at the end.
During the body of the calculation we have two cases. If the leading digit of the divisor is smaller than that of the absolute value of the number it divides, the dividend, we add the truncated ratio of them to the result at the appropriate position and subtract the product from it. If not, we do the same, but with the dividend multiplied by the bignum base.
Now, the second case may involve some loss of precision but it doesn't matter. The subsequent multiplication and subtraction are exact, so even if the correction to the ratio doesn't go quite far enough it'll be pretty close and the difference will be made up during the next iteration.
Program 3 demonstrates bignum division.

Program 3: Bignum Division

Now if my implementation of bignum multiplication is considered naive, then my implementation of bignum division is positively gullible. Not only does it rely upon said naive multiplication operations, but it requires more of them than are required by typical division algorithms. Once again I must appeal to simplicity for justification of my approach. Rudimentary, yadda yadda yadda. Exposition, blah blah blah.

All of the ak.bignum overloads are given in listing 19.

Listing 19: ak.bignum Overloads
ak.overload(ak.abs,  ak.BIGNUM_T, abs);
ak.overload(ak.neg,  ak.BIGNUM_T, neg);

ak.overload(ak.add, [ak.BIGNUM_T, ak.BIGNUM_T], add);
ak.overload(ak.cmp, [ak.BIGNUM_T, ak.BIGNUM_T], cmp);
ak.overload(ak.div, [ak.BIGNUM_T, ak.BIGNUM_T], div);
ak.overload(ak.eq,  [ak.BIGNUM_T, ak.BIGNUM_T], eq);
ak.overload(ak.ge,  [ak.BIGNUM_T, ak.BIGNUM_T], ge);
ak.overload(ak.gt,  [ak.BIGNUM_T, ak.BIGNUM_T], gt);
ak.overload(ak.le,  [ak.BIGNUM_T, ak.BIGNUM_T], le);
ak.overload(ak.lt,  [ak.BIGNUM_T, ak.BIGNUM_T], lt);
ak.overload(ak.mod, [ak.BIGNUM_T, ak.BIGNUM_T], mod);
ak.overload(ak.mul, [ak.BIGNUM_T, ak.BIGNUM_T], mul);
ak.overload(ak.ne,  [ak.BIGNUM_T, ak.BIGNUM_T], ne);
ak.overload(ak.sub, [ak.BIGNUM_T, ak.BIGNUM_T], sub);

Bignum.js contains the complete implementation of ak.bignum.

The Strengths Of Arbitrary Precision

As integers, bignums are extremely capable, able to represent any value no matter how large, memory permitting. This makes them ideal for fixing fixed point's susceptibility to overflow. Furthermore, they can be used to perform calculations with much greater precision than IEEE double precision, as illustrated in program 4.

Program 4: Fun With Bignums

Recall that ak.calc is an RPN calculator that forwards its operation to our overloaded functions.
Now this program may look rather like magic, but in fact it's a fairly straightforward Taylor series approximation of Machin's formula for \(\pi\)
\[ \frac{\pi}{4} = 4 \arctan \frac{1}{5} - \arctan \frac{1}{239} \]
and is one of the oldest schemes for calculating large numbers of its digits.

The Weaknesses Of Arbitrary Precision

Unfortunately bignums are by no means a panacea for the problems of fixed point arithmetic.
Firstly, and least importantly, they are not very space efficient. Normal double precision floating point numbers (i.e. those that aren't suffering from underflow) can cram values from the order of 10-308 to 10+308 into a mere eight bytes. In contrast, values of such magnitudes would require over 150 bytes with my bignums.
Secondly, they're not very computationally efficient either. Even if you were to replace my ham-fisted implementations of multiplication and division with something decent, they couldn't remotely compete with floating point arithmetic which is generally implemented in hardware.
Finally, and most importantly, whilst arbitrary precision fixed point numbers can have as many decimal places as time and space allow, they still suffer from rounding errors. The simplest example of the problem is that the expression
\[ \frac{1}{3} \times 3 \]
will never compare equal to one, no matter how many decimal places we use, since it'll have a value something like
\[ 0.99999999999999999999999999999999999\dots \]
So not only is arbitrary precision fixed point less efficient than floating point, it does not solve the problem of rounding. As such, it really isn't much better for general purpose arithmetic than ordinary fixed point and is consequently no better a candidate for replacing floating point.

References

[1] You're Going To Have To Think! Why Fixed Point Arithmetic Won’t Cure Your Floating Point Blues, www.thusspakeak.com, 2013.

Based upon an article I wrote for ACCU's Overload magazine.

Leave a comment