The After Strife

| 0 Comments

As well as required arithmetic operations, such as addition, subtraction, multiplication and division, the IEEE 754 floating point standard[1] has a number of recommended functions. For example finite determines whether its argument is neither infinite nor NaN and isnan determines whether its argument is NaN; behaviours that shouldn't be particularly surprising since they're more or less equivalent to JavaScript's isFinite and isNaN functions respectively.
One recommended function that JavaScript does not provide, and which I should like to add to the ak library, is nextafter which returns the first representable floating point number after its first argument in the direction towards its second.

A Bit Fiddly

Figure 1: IEEE 754 Single Precision Floating Point
±a1a2a3...a8b1b2b3...b23
 
  Binary Exponent (a1...a8)     Decimal Exponent     Binary Value  
  00000000     0     ±0.b1b2b3...b23×2-126  
  00000001     1     ±1.b1b2b3...b23×2-126  
  00000010     2     ±1.b1b2b3...b23×2-125  
  00000011     3     ±1.b1b2b3...b23×2-124  
  ...     ...     ...  
  01111111     127     ±1.b1b2b3...b23×20  
  10000000     128     ±1.b1b2b3...b23×21  
  ...     ...     ...  
  11111100     252     ±1.b1b2b3...b23×2125  
  11111101     253     ±1.b1b2b3...b23×2126  
  11111110     254     ±1.b1b2b3...b23×2127  
  11111111     255     ±∞ if b=0
NaN otherwise
If we were working with a relatively low level language, such as C, then my first instinct when implementing this function would be to reinterpret the memory containing its floating point argument as an array of bytes and use bitwise operations to extract the sign bit as a boolean and the bits of the exponent and significand as integers. Recall that IEEE floating point numbers take the form
\[ \pm b \times 2^a \]
where \(b\) is the significand and \(a\) is the exponent, and are arranged in memory with the sign bit first, followed by the bits of the exponent and then those of the significand, as summarised in figure 1 for 32 bit single precision numbers. Note that finite numbers with non-zero exponent bits have an implied leading digit of one and are referred to as normal, whereas those whose exponents' bits are all zero have an implied leading digit of zero and are referred to as subnormal.
Now we could use integer operations to calculate the bits of the next floating point number, put them back into an array of bytes and finally reinterpret its contents as the floating point result.

Unfortunately, JavaScript isn't a relatively low level language and won't let us do this sort of thing, so we shall have to come up with an alternative approach.

Go Forth And Multiply

If you have a particularly good memory you will recall that we did something similar when widening the bounds of our interval arithmetic numerical type[2], interval. This sought to keep track of numerical errors by defining the results of arithmetic operations as the widest possible intervals given the ranges of values in their interval arguments, finally widening their bounds to reflect rounding error.
For example, for positive normal values \(l_0\), \(u_0\), \(l_1\) and \(u_1\) we defined addition as
\[ \left[l_0, u_0\right] + \left[l_1, u_1\right] = \left[\left(l_0+l_1\right) \times \left(1-\epsilon\right), \left(u_0+u_1\right) \times \left(1+\epsilon\right)\right] \]
where \(\epsilon\) is the difference between one and the smallest floating point number greater than one, known as the floating point epsilon.
Derivation 1: Next Towards Zero
Given a positive normal floating point number
\[ x = 1 \, . \, b_1 \, b_2 \, \dots \, b_{n-1} \, b_n \, \times \, 2^a \]
then, using exact arithmetic, we have
\[ \begin{align*} &x \times \left(1-\tfrac12\epsilon\right) = x - x \times \tfrac12\epsilon\\ &\quad= 1 \, . \, b_1 \, b_2 \, \dots \, b_{n-1} \, b_n \, 0 \, 0_\phantom{1} \, 0_\phantom{2} \, \dots \, 0_\phantom{n-1} \, 0_\phantom{n} \, \times \, 2^a\\ &\quad- 0 \, . \, 0_\phantom{1} \, 0_\phantom{2} \, \dots \, 0_\phantom{n-1} \, 0_\phantom{n} \, 1 \, b_1 \, b_2 \, \dots \, b_{n-1} \, b_n \, \times \, 2^a\\ \end{align*} \]
Firstly, if any of the bits \(b_i\) are non-zero then the \((n+1)^\mathrm{th}\) bit of the subtraction must include a carry from the subtractions of the following bits and so it will result in zero with another carry.
Now if the last non-zero bit of \(x\) is \(b_k\) this yields
\[ 1 \, . \, b_1 \, b_2 \, \dots \, b_{k-1} \, 0 \, 1 \, \dots \, 1 | 0 \, c_1 \, c_2 \, \dots \, c_{n-1} \, c_n \, \times \, 2^a \]
for some bits \(c_i\) and where the vertical bar marks the end of the bits of the significand, which will round down to
\[ 1 \, . \, b_1 \, b_2 \, \dots \, b_{k-1} \, 0 \, 1 \, \dots \, 1 \, \times \, 2^a \]
as required.
Secondly, if all of the bits \(b_i\) are zero then, by the same derivation of the value of \(1-\frac12\epsilon\), we have
\[ \begin{align*} x \times \left(1-\tfrac12\epsilon\right) &= 0 \, . \, 1 \, 1 \, \dots \, 1 | 1 \, 0 \, \dots 0 \, \times \, 2^a\\ &= 1 \, . \, 1 \, \dots \, 1 \, 1 \, \times \, 2^{a-1} \end{align*} \]
which is exactly representable as a floating point number if \(a\) is greater than the most negative normal exponent.
Finally, for negative floating point numbers we simply negate them before the multiplication and rounding and then negate the result to prove that this yields the next closest floating point number to zero.
The final multiplication of the lower and upper bounds by \(1-\epsilon\) and \(1+\epsilon\) has the advantage of being simple to implement but unfortunately frequently overestimates the effect of rounding. Nevertheless, the idea of using multiplication to find the next floating point number is an appealing one.

The first thing to note is that the difference between one and the greatest floating point number that is less than one is, in fact, \(\frac12\epsilon\) since the results of arithmetic operations are normalised to have a leading digit of one whenever possible and consequently
\[ \begin{align*} 1 - \epsilon = &1.0 \dots 00 \times 2^0\\ -&0.0 \dots 01 \times 2^0\\ = &0.1 \dots 11 \times 2^0\\ = &1.1 \dots 10 \times 2^{-1}\\ \\ 1 - \tfrac12\epsilon = &1 - \epsilon + \tfrac12\epsilon\\ = &1.1 \dots 10 \times 2^{-1}\\ + &0.0 \dots 01 \times 2^{-1}\\ = &1.1 \dots 11 \times 2^{-1} \end{align*} \]
This means that for any finite floating point number that's greater than the smallest normal number, or less than its negation, we can find the next closest one to zero by multiplying it by \(1-\frac12\epsilon\), as proven in derivation 1.

The next thing to note is that, even though \(1+\epsilon\) is the first floating point number greater than one, \(x \times (1+\epsilon)\) is not necessarily the first floating point number away from zero after \(x\), as demonstrated by program 1 which calculates the difference between \(x\) and the first floating point number that's less than \(x \times (1+\epsilon)\) for some randomly chosen values of \(x\).

Program 1: Not The Next After

A simple way to see why this fails is to multiply \(1\frac34\) by \(1+\epsilon\) using exact arithmetic to yield
\[ \begin{align*} 1\tfrac34 + 1\tfrac34 \times \epsilon = &1 \, . \, 1 \, 1 \, 0 \, \dots \, 0 \, 0 | 0 \, 0 \, 0 \, \dots \, 0 \, 0 \, \times 2^0\\ + &0 \, . \, 0 \, 0 \, 0 \, \dots \, 0 \, 1 | 1 \, 1 \, 0 \, \dots \, 0 \, 0 \, \times 2^0\\ = &1 \, . \, 1 \, 1 \, 0 \, \dots \, 0 \, 1 | 1 \, 1 \, 0 \, \dots \, 0 \, 0 \, \times 2^0 \end{align*} \]
where, as in derivation 1, the vertical bar marks the end of the significand. When this is rounded to the nearest floating point number we get
\[ 1\tfrac34 \times (1+\epsilon) = 1 \, . \, 1 \, 1 \, 0 \, \dots \, 1 \, 0 \, \times 2^0 \]
which is clearly a step too far.

Now we could implement nextafter away from zero by multiplying its argument \(x\) by \(1 + \epsilon\), then multiplying the result by \(1 -\tfrac12\epsilon\) and returning the latter if it's not equal to \(x\) and the former if it is. We can, however, do better than that.

The Next Division

The trick is to divide \(x\) by \(1-\frac12\epsilon\) instead, as demonstrated by program 2.

Program 2: The Next After All

Derivation 2: Bounds On The Division's Result
First, recall that a geometric series is defined as
\[ S_n = \sum_{i=1}^n a r^{i-1} = a \frac{1-r^n}{1-r} \]
where \(\sum\) is the summation sign. Setting \(r\) to \(\frac12\epsilon\), \(a\) to one and \(n\) to infinity yields
\[ \sum_{i=1}^\infty \left(\tfrac12\epsilon\right)^{i-1} = \frac{1}{1-\tfrac12\epsilon} \]
and so
\[ \frac{x}{1-\tfrac12\epsilon} = x \times \left(1 + \tfrac12\epsilon + \sum_{i=3}^\infty \left(\tfrac12\epsilon\right)^{i-1}\right) \]
Next, we have
\[ \begin{align*} \sum_{i=3}^\infty \left(\tfrac12\epsilon\right)^{i-1} &= \sum_{i=1}^\infty \tfrac14\epsilon^2 \left(\tfrac12\epsilon\right)^{i-1} = \frac{\tfrac14\epsilon^2}{1-\tfrac12\epsilon} \end{align*} \]
Finally, we can derived bounds for this with
\[ \begin{array}{rcccl} \tfrac12\epsilon^2 - \tfrac18\epsilon^3 &<& \tfrac12\epsilon^2 &<& \tfrac12\epsilon\\ \tfrac14\epsilon^2 - \tfrac18\epsilon^3 &<& \tfrac14\epsilon^2 &<& \tfrac12\epsilon - \tfrac14\epsilon^2\\ \tfrac14\epsilon^2 \left(1 - \tfrac12\epsilon\right) &<& \tfrac14\epsilon^2 &<& \tfrac12\epsilon \left(1 - \tfrac12\epsilon\right)\\ \tfrac14\epsilon^2 &<& \frac{\tfrac14\epsilon^2}{1 - \tfrac12\epsilon} &<& \tfrac12\epsilon \end{array} \]
and consequently
\[ x \times \left(1+\tfrac12\epsilon\right) < \frac{x}{1-\tfrac12\epsilon} < x \times \left(1+\epsilon\right) \]
Now it's not entirely obvious why this works and so you may be forgiven if you're a little sceptical that it will do so for all normal floating point numbers.
To see why it does, first note that
\[ \begin{align*} x \div \left(1-\tfrac12\epsilon\right) &< x \times \left(1+\epsilon\right)\\ x \div \left(1-\tfrac12\epsilon\right) &> x \times \left(1+\tfrac12\epsilon\right) \end{align*} \]
as proven by derivation 2.
This means that the result of the division will only add bits to \(x\) after the last digit in its significand and that they must be rounded up when it is represented as a floating point number.

Nifty, eh?

Those Guys Ain't Normal

Now everything that we've done so far depends upon the fact that normal floating point numbers have an implied leading digit of one and since subnormal numbers don't, we shall have to handle them differently.
Fortunately, they're much easier to deal with than normal numbers since the least significant digit of every subnormal number corresponds to the same power of two. In fact, this is true of all floating point numbers with a magnitude that is smaller than twice the smallest normal number.
For any such number we need simply add or subtract the smallest positive subnormal number to find the next floating point number after or before it. JavaScript conveniently defines Number.MIN_VALUE to be precisely that number, which we've also copied into ak.MIN_VALUE.

We must also take care to treat non-finite numbers as special cases. Specifically, the next numbers towards zero after plus and minus infinity are plus and minus the largest finite floating point number, provided by JavaScript's Number.MAX_VALUE and also copied into ak.MAX_VALUE. Furthermore, the next number after NaN towards any number should also be NaN, as should be the next number towards NaN after any number.

Finally, we shall define the next number after any number towards itself to be itself.

Implementation Details

Now that we've worked out how to find the next floating point number after any given floating point number in any given direction using nothing but arithmetic operations, all that's left to do is implement it. This is done by the ak.nextAfter function given in listing 1.

Listing 1: ak.nextAfter
var M = 2*ak.MIN_NORMAL;
var R = 1.0-0.5*ak.EPSILON;

ak.nextAfter = function(x, y) {
 x = Number(x);
 y = Number(y);

 if(y>x) {
  if(x>= M) return x/R;
  if(x<=-M) return x!==-ak.INFINITY ? x*R : -ak.MAX_VALUE;
  return x+ak.MIN_VALUE;
 }
 else if(y<x) {
  if(x<=-M) return x/R;
  if(x>= M) return x!==ak.INFINITY ? x*R : ak.MAX_VALUE;
  return x-ak.MIN_VALUE;
 }
 return isNaN(y) ? y : x;
};

I don't think that there's anything particularly surprising about this function, given that it simply branches to the appropriate arithmetic operation for any given number. There is perhaps a slight subtlety in the final return statement which will be reached if x equals y or if either is NaN; if y is NaN and x isn't then we should return NaN which is why we need the final test.

Program 3 demonstrates how to use ak.nextAfter to find the next floating point numbers toward plus and minus infinity for numbers of various magnitudes, using our ak.normalInvCDF implementation of the inverse of the normal cumulative distribution function to generate them.

Program 3: Using ak.nextAfter

All that's left to say is that the implementation of ak.nextAfter can be found in NextAfter.js.

References

[1] IEEE standard for binary floating-point arithmetic. ANSI/IEEE std 754-1985, 1985.

[2] You're Going To Have To Think! Why Interval Arithmetic Won't Cure Your Floating Point Blues, www.thusspakeak.com, 2013.

Leave a comment