Floating point arithmetic is a much maligned beast. Too many times have I heard programmers express the opinion that any program using floating point numbers is broken by design. Even Douglas Crockford, whose advice I have largely followed on matters of JavaScript, is guilty of it, placing them in the Awful Parts appendix of his JavaScript: The Good Parts^{[1]}.
The IEEE 754 floating point standard^{[2]} is, in fact, incredibly well designed and anyone who considers it broken because of the behaviour he describes has thoroughly misplaced their blame.
However, I suspect that we are somewhat less aware of how they fit into the general mathematical classification of number types. Therefore, before we start looking at the various techniques for representing numbers with computers I should like to explore what exactly it is we mean by number.
The concept of number has been refined throughout history as generations of mathematicians have time and again stumbled across inconsistencies in their understanding.
A hierarchy of number types as we currently understand them is provided in figure 1.
Traversing this tree from left to right, we more or less recover the sequence of development of our concept of number from prehistory to the modern day.
The story of number begins with the integers, or more accurately the natural numbers; whole numbers that are greater than zero. Animal studies have shown that primates, rats and even some birds have a rudimentary ability to count; presumably using neural circuitry similar to that we use to distinguish at a glance between three and four objects, but that cannot distinguish between nineteen and twenty. It is not unreasonable, therefore, to suppose that a basic awareness of the natural numbers predates man.
Negative numbers are, comparatively, an example of striking modernity having been discovered in India just a few millennia ago.
The integers, as important as they are, are not particularly useful for measurement; the distance between the ziggurat and the brothel is never quite a whole number of cubits, for example. For this task we instead employed the fractions, or rationals; those numbers equal to the ratio of two integers. Note that the rationals are a superset of the integers; every integer is trivially the ratio of itself and 1. For many years it was thought that the rationals comprised the totality of number. Legend has it that a member of the school of Pythagoras discovered that the square root of 2 could not be expressed as a fraction and that his compatriots were so put out by this fact that they drowned him (I shall revisit this in a later post).
The algebraic irrationals are those numbers which are roots of polynomial equations with rational, or equivalently integer, coefficients. By roots we mean those real values, if any, at which the polynomial equates to zero; \(\sqrt{2}\) is a root of the polynomial \(x^22\), for example. Technically, the algebraic numbers are a superset of the rationals since the latter are solutions to linear equations with integer coefficients.
The final breed of numbers, the transcendentals, are the most elusive. These are the numbers which are not solutions of polynomial equations with rational coefficients and include such notable numbers as \(\pi\) and \(e\). They are so difficult to identify that it is still not known whether \(\pi + e\) is transcendental. Despite this, it is known that the transcendentals form the vast majority of numbers; if you were to throw a dart at a line representing the numbers between 0 and 1, you would almost certainly hit a transcendental. To understand why, we need to discuss the mathematics of infinite sets.
Cantor named the smallest of the transfinite cardinals, the size of the integers, \(\aleph_0\). This is known as the countable infinity since we can imagine an algorithm that, given infinite time, would step through them sequentially, counting them off one at a time.
He then asked the question of whether or not the rationals were larger than the integers, of whether or not they were uncountable. His proof that they were not is one of the most elegant in all of mathematics.
When we say a set is countable, we strictly mean that it can be put into a one to one correspondence with the nonnegative integers. For example, the integers are countable since we can map from the nonnegative integers to them with the rules
which clearly counts through the integers, one at a time.
Cantor laid out the rationals such that the numerator (the number on top of the fraction) was indicated by the column and the denominator (the number on the bottom of the fraction) was indicated by the row, as shown in figure 2.
What Cantor realised was that, whilst each row and column stretched on forever and so couldn’t be counted one after the other, the diagonals between the first column of a given row and the first row of the corresponding column were all finite and hence countable. For example, we could iterate over the first row, counting diagonally backwards through the table until we hit the first column yielding the sequence
If we skip any number we have seen before, we have the sequence
which counts through all of the rationals, one at a time.
So, rather surprisingly, despite there being an infinite number of fractions between any two different integers, the sizes of the set of fractions and the set of integers are in fact equal.
Cantor proceeded to demonstrate that the set of polynomial equations with integer coefficients is also countable and, since each has a finite number of roots, so are the algebraic numbers.
He did this by defining a function, we shall call it \(c\), that takes a polynomial with integer coefficients and returns a positive integer. It operates by adding together the absolute values of the coefficients and the order of the polynomial, the highest power to which the variable is raised, minus one.
For example
Cantor realised that every possible value of this function is shared by a finite number of such polynomials. For example, there are just 4 such polynomials for which this function yields 2.
We are left with the question of whether or not the transcendental numbers are of the same size.
If the transcendental numbers are countable then the real numbers, being the union of both they and the algebraic numbers, must be countable too since we could simply alternate between the sequences of each of them.
Cantor noted that if the reals were countable we could construct a list of them as they are generated by the mapping from the integers. Figure 3 illustrates what this list might look like for just the numbers between 0 and 1.
Now starting after the decimal point in the first row and moving diagonally down and to the right we can construct a new number
This number is clearly between 0 and 1, but must differ from every number in the list at no less than one digit. We have thus found a number between 0 and 1 that was not in our list and hence the list is incomplete. It is not, therefore, possible to construct such a list and hence the reals, and consequently the transcendentals, are uncountably infinite. Being more sizable than the other numbers, their cardinal number is denoted by \(\aleph_1\).
Neat, huh?
The IEEE standard defines floating point numbers to have a format similar to the scientific notation many of us will recognise from our calculators and spreadsheets. In the familiar decimal base 10 this means a number between 1 and 10 multiplied by 10 raised to the power of another number.
For example, the number of days in a year is approximately 365. Dividing by this 100 gives us a number between 1 and 10, namely 3.65. Since 100 is 10 raised to the power of 2, the number of days in the year can be written as3.65 × 10^{2} , or commonly 3.65E2.
The number that we multiply by the power of 10 is known as the significand and the power of 10 by which we multiply it is known as the exponent.
Since base 10 is rather inconvenient from a computing perspective, IEEE floating point numbers are defined in the binary base 2. Specifically, numbers are defined as±b × 2^{a} with, in the single precision format, the sign taking one bit, the exponent a taking 8 bits and the significand b taking 24. As in decimal the significand must lie between 1 and 10, so in binary it must lie between 1 and 2. The leading digit must therefore be 1, and we can represent b with 23 bits rather than the full 24.
There is, in fact, a special case when we assume the leading digit is 0 rather than 1. This occurs when the exponent takes on its most negative value, yielding the very smallest floating point numbers. Since the leading digits of these numbers, known as subnormal or denormalised numbers, are 0 they have fewer significant digits of accuracy, or equivalently have lower precision. In contrast recall that normal numbers have an implied leading digit of 1 and consequently have the full 24 bits with which to represent the significand.
In addition to the normal and subnormal numbers, IEEE 754 defines bit patterns to represent the positive and negative infinities and a set of error values for invalid calculations known as the NaNs, for Not a Number.
Figure 4 enumerates the full set of IEEE 754 single precision floating point numbers.
Note that since the significand is finite, floating point numbers are actually a finite subset of the rational numbers and it is vitally important not to confuse them with real numbers.
From this point on I shall assume that we're using double precision numbers. Given that JavaScript numbers are of this type I couldn't give examples in single point anyway.
Now that we have covered the mundane implementation details of floating point numbers it is time to start looking at the rather more important topic of their precise behaviour.
Furthermore, NaNs always compare unequal, even to themselves. Every comparison operator other than not equal are similarly always false if they involve NaNs. If you keep this in mind when designing loops and branches, you can ensure that your algorithms will behave predictably in the face of invalid arithmetic operations.
Note that whilst there are many NaNs, JavaScript only supports one of them.
Dividing any finite number by an infinity will yield zero and dividing any nonzero number by zero will yield an infinity of the same sign as that number. Adding or subtracting any finite number to or from an infinity will result in that infinity.
These properties mean that many numerical algorithms can implicitly cope with numerical overflow since arithmetic operations and comparisons are internally consistent and, accompanied by some vigorous hand waving, mathematically sound.
Note that dividing 0 by 0, dividing an infinity by an infinity, multiplying an infinity by 0 and subtracting an infinity from itself all yield NaNs.
We can illustrate the problem by considering decimal notation. Say we restrict ourselves to 4 figures after the decimal point. Assuming that we have chosen the closest number in this representation, x, to a given number we can only say that its true value lies somewhere within x±5E5. For example, given π to 4 decimal places, 3.1416, we can only state with certainty that it lies between 3.14155 and 3.14165.
Similarly, for an IEEE double precision floating point number x with an exponent of a, we can only be sure that the true value lies between x±2^{a53}. Conveniently, since normalised floating point numbers have a implicit leading digit of 1, these bounds can be written as x×(1±2^{53}) or, conventionally, as x×(1±½ε). The value of ε is provided in the
Of course, this means that operations on denormalised numbers will introduce proportionally even greater errors but we shall ignore this fact in our analyses and effectively treat them as if they behave in the same fashion as zero.
If an algorithm really must treat denormalised numbers with the same respect as normalised numbers, it will require much more careful analysis.
The mathematical operations of addition, subtraction, multiplication, division, remainder and square root are required by the IEEE standard to be accurate to within rounding error. Specifically, they must return the correctly rounded representation of the result of performing the actual calculation with real numbers. This means that, if using round to nearest, they will introduce a proportional error no larger than (1±½ε).
Note that because of these accumulated rounding errors, equality comparisons between floating point numbers often behave counterintuitively; values of unlike expressions that should mathematically be equal may have accumulated slightly different rounding errors.
In general, we should prefer to test whether two floating point numbers are similar to each other rather than the same.
We can capture the sensitivity of a function \(f\) to rounding errors in its argument \(x\) with the condition number, given by
This value is approximately equal to the absolute value of the ratio between the relative error of \(f\), \(\epsilon_f\), and that of its argument \(x\), \(\epsilon_x\), as shown in derivation 1. Note that \(f\) is assumed to be exact; the condition number captures the sensitivity of mathematical functions to errors in their arguments, not numerical errors in the implementations of such functions.
As an example, consider the exponential function \(e^x\) whose derivative is equal to \(e^x\) for all \(x\). Its condition number is therefore \(x\) and the absolute value of its relative error is approximately equal to \(x \times \epsilon_x\).
When the condition number is large, a function is said to be poorly conditioned and we cannot trust that its results are accurate to many digits of precision.
We can think of the base \(b\) logarithm of a number \(x\) as the (possibly fractional) position of its most significant digit in that base since \(x = b^{\log_b(x)}\). Consequently, a relative error of \(\epsilon_x\) implies that a value is accurate to \(\log_b(\epsilon_x)\) digits as this is the difference between the positions of the most significant digits of the true value and of the error
Far more worrying is cancellation error which can yield catastrophic loss of precision. When we subtract two almost equal numbers we set the most significant digits to zero, leaving ourselves with just the insignificant, and most erroneous, digits.
For example, suppose that we have two values close to \(\pi\), 3.1415 and 3.1416. These values are both accurate to 5 significant figures, but their difference is equal to 0.0001, or 1.0E4, and has just 1 significant figure of accuracy.
Whilst rounding error might sneak up upon us in the end, cancellation error is liable to beat us about the head with a length of two by four.
Note that since cancellation error results from the dramatic sensitivity of the subtraction of nearly equal numbers to the rounding errors in those numbers, it can be captured by the condition number.
For example, the expression \(x1\) has a condition number of \(x/(x1)\). As \(x\) tends to 1, the condition number tends to infinity, reflecting the growing effect of cancellation error on the result.
The poster child of cancellation error is the numerical approximation of differentiation with the forward finite difference. The derivative of a function \(f\) at a point \(x\) can be defined (albeit with more handwaving than a royal procession) as the limit, if one exists, of
The forward finite difference replaces the limit with a very small, but nonzero, \(\delta\) and is a reasonably obvious way to approximate the derivative. It is equally obvious that we should choose \(\delta\) to be as small as possible, right?
Wrong!
To demonstrate why not, consider the function \(e^x\) whose derivative at 1 is trivially equal to \(e\). Program 1 plots the number of accurate binary digits in the finite difference approximation against the position of the most significant binary digit in \(\delta\), with a delay added to the loop for effect.
Clearly, decreasing \(\delta\) works up to a point as indicated by an initial linear relationship between the number of leading zeros and the number of accurate bits. However, this relationship seems to break down beyond \(\delta\) equal to 2^{25} and the best accuracy occurs when \(\delta\) is equal to 2^{26}.
This is suspiciously close to half of the number of bits in the significand of a double precision floating point number. In fact it can be proven, under some simplifying assumptions, that the optimal choice for \(\delta\) is the square root of \(\epsilon\), which has roughly that many leading zeros.
However, to do so requires Taylor's theorem, which we must therefore cover first.
It demonstrates how a sufficiently differentiable function can be approximated within some region to within some error by a polynomial. Specifically, it shows that
Proof of this theorem is beyond the scope of this site so, as averse as I am to unfounded assertions, you shall simply have to take my word for it this time. I will, however, demonstrate it by applying it to the exponential function, whose derivative is equal to itself, with \(x=0\) and \(\delta=1\)
In fact, the error term has exact bounds given by a more accurate statement of Taylor’s theorem
for some \(y\) between \(x\) and \(x + \delta\).
Derivation 2 uses Taylor's theorem to prove that the finite difference has an approximation error of
With some vigorous hand waving, we can ignore the constant factors in these terms, and conclude that since a choice of \(\delta = \sqrt{\epsilon}\) results in them both having the same order of magnitude it is, in some sense, optimal.
Since floating point numbers have a fixed precision, the relative difference would seem to be the preferable measure of similarity. Unfortunately it amplifies the difference between numbers that are close to zero; exactly those numbers that might be suffering from cancellation error.
If the inputs to our calculations are not too many of orders of magnitude away from unity, as will generally be the case given our predisposition for units that ensure measurements are neither too large nor too small, then we should prefer relative differences for results much greater than one and absolute differences for results much smaller than one.
The function
For example, suppose that we wish to calculate the cube of the square root of a number, or equivalently the square root of the cube. Starting out with a value \(x\) accurate to within rounding, we have
Whilst this is a relatively simple example, the fundamental lesson is sound; in order to control errors when using floating point numbers we must plan our calculations with care.
Such unfair criticism of floating point is not particularly unusual, is often unduly concerned with rounding error and hardly ever mentions the vastly more important topic of cancellation error. One can only assume that many computer science graduates have forgotten their numerical computing lectures and have generalised the very specific and predictable failure modes of floating point arithmetic to the rule of thumb that it should be avoided entirely.
These criticisms are generally accompanied by suggestions of alternative arithmetic types that fix the perceived problems with floating point. We shall investigate these in coming posts and we shall learn that if we wish to use computers for arithmetic calculation we shall have to accept the fact that we are going to have to think.
[2] IEEE standard for binary floatingpoint arithmetic. ANSI/IEEE std 7541985, 1985.
Binary floatingpoint numbers are inept at handling decimal fractions, so 0.1 + 0.2 is not equal to 0.3. This is the most frequently reported bug in JavaScript, and it is an intentional consequence of having adopted the IEEE Standard for Binary Floating Point Arithmetic (IEEE 754). This standard is wellsuited for many applications, but it violates most of the things you learned about numbers in middle school.
 Douglas Crockford, JavaScript: The Good Parts
The IEEE 754 floating point standard^{[2]} is, in fact, incredibly well designed and anyone who considers it broken because of the behaviour he describes has thoroughly misplaced their blame.
On The Classification Of Numbers
As programmers we are probably quite aware that integers and floating point numbers have different properties, even if we haven’t spent a great deal of time thinking about their precise nature.However, I suspect that we are somewhat less aware of how they fit into the general mathematical classification of number types. Therefore, before we start looking at the various techniques for representing numbers with computers I should like to explore what exactly it is we mean by number.
The concept of number has been refined throughout history as generations of mathematicians have time and again stumbled across inconsistencies in their understanding.
Figure 1: A Hierarchy Of Numbers
Real  
Rational  Irrational  
Integer  Algebraic  Transcendental 
Traversing this tree from left to right, we more or less recover the sequence of development of our concept of number from prehistory to the modern day.
The story of number begins with the integers, or more accurately the natural numbers; whole numbers that are greater than zero. Animal studies have shown that primates, rats and even some birds have a rudimentary ability to count; presumably using neural circuitry similar to that we use to distinguish at a glance between three and four objects, but that cannot distinguish between nineteen and twenty. It is not unreasonable, therefore, to suppose that a basic awareness of the natural numbers predates man.
Negative numbers are, comparatively, an example of striking modernity having been discovered in India just a few millennia ago.
The integers, as important as they are, are not particularly useful for measurement; the distance between the ziggurat and the brothel is never quite a whole number of cubits, for example. For this task we instead employed the fractions, or rationals; those numbers equal to the ratio of two integers. Note that the rationals are a superset of the integers; every integer is trivially the ratio of itself and 1. For many years it was thought that the rationals comprised the totality of number. Legend has it that a member of the school of Pythagoras discovered that the square root of 2 could not be expressed as a fraction and that his compatriots were so put out by this fact that they drowned him (I shall revisit this in a later post).
The algebraic irrationals are those numbers which are roots of polynomial equations with rational, or equivalently integer, coefficients. By roots we mean those real values, if any, at which the polynomial equates to zero; \(\sqrt{2}\) is a root of the polynomial \(x^22\), for example. Technically, the algebraic numbers are a superset of the rationals since the latter are solutions to linear equations with integer coefficients.
The final breed of numbers, the transcendentals, are the most elusive. These are the numbers which are not solutions of polynomial equations with rational coefficients and include such notable numbers as \(\pi\) and \(e\). They are so difficult to identify that it is still not known whether \(\pi + e\) is transcendental. Despite this, it is known that the transcendentals form the vast majority of numbers; if you were to throw a dart at a line representing the numbers between 0 and 1, you would almost certainly hit a transcendental. To understand why, we need to discuss the mathematics of infinite sets.
Transfinite Cardinals
In the late 19^{th} century Georg Cantor developed the theory of infinite sets. The transfinite cardinals are not, as their name suggests, characters in a Catholic science fiction blockbuster, but are in fact those infinite numbers that denote the size of infinite sets.Cantor named the smallest of the transfinite cardinals, the size of the integers, \(\aleph_0\). This is known as the countable infinity since we can imagine an algorithm that, given infinite time, would step through them sequentially, counting them off one at a time.
He then asked the question of whether or not the rationals were larger than the integers, of whether or not they were uncountable. His proof that they were not is one of the most elegant in all of mathematics.
When we say a set is countable, we strictly mean that it can be put into a one to one correspondence with the nonnegative integers. For example, the integers are countable since we can map from the nonnegative integers to them with the rules
 if n is even, n maps to ½n
 if n is odd, n maps to ½(n+1)
0, 1, 1, 2, 2, 3, 3, ...
which clearly counts through the integers, one at a time.
Cantor laid out the rationals such that the numerator (the number on top of the fraction) was indicated by the column and the denominator (the number on the bottom of the fraction) was indicated by the row, as shown in figure 2.
Figure 2: Cantor's Table Of Rationals
1  2  3  4  5  ...  
1  1/1  2/1  3/1  4/1  5/1  ... 
2  1/2  2/2  3/2  4/2  5/2  ... 
3  1/3  2/3  3/3  4/3  5/3  ... 
4  1/4  2/4  3/4  4/4  5/4  ... 
5  1/5  2/5  3/5  4/5  5/5  ... 
...  ...  ...  ...  ...  ...  ... 
What Cantor realised was that, whilst each row and column stretched on forever and so couldn’t be counted one after the other, the diagonals between the first column of a given row and the first row of the corresponding column were all finite and hence countable. For example, we could iterate over the first row, counting diagonally backwards through the table until we hit the first column yielding the sequence
1/1, 2/1, 1/2, 3/1, 2/2, 1/3, 4/1, 3/2, 2/3, 1/4, ...
If we skip any number we have seen before, we have the sequence
1, 2, 1/2, 3, 1/3, 4, 3/2, 2/3, 1/4, ...
which counts through all of the rationals, one at a time.
So, rather surprisingly, despite there being an infinite number of fractions between any two different integers, the sizes of the set of fractions and the set of integers are in fact equal.
Cantor proceeded to demonstrate that the set of polynomial equations with integer coefficients is also countable and, since each has a finite number of roots, so are the algebraic numbers.
He did this by defining a function, we shall call it \(c\), that takes a polynomial with integer coefficients and returns a positive integer. It operates by adding together the absolute values of the coefficients and the order of the polynomial, the highest power to which the variable is raised, minus one.
For example
\[
c(2x^2  3x + 4) = 2 + 3 + 4 + (21) = 10
\]
Note that we can insist that the term with the highest order is positive, since multiplying a polynomial by minus one doesn’t affect its roots.Cantor realised that every possible value of this function is shared by a finite number of such polynomials. For example, there are just 4 such polynomials for which this function yields 2.
\[
2x\\
x+1\\
x1\\
x^2
\]
So we can count off these polynomials by counting through the positive integers and, for each of them in turn, enumerating the members of the finite set of polynomials for which Cantor’s function returns that value.We are left with the question of whether or not the transcendental numbers are of the same size.
If the transcendental numbers are countable then the real numbers, being the union of both they and the algebraic numbers, must be countable too since we could simply alternate between the sequences of each of them.
Figure 3: Cantor's List Of Reals
\[
0\,.\,x_{00}\,x_{01}\,x_{02}\,x_{03}\,x_{04}\,x_{05}\,...\\
0\,.\,x_{10}\,x_{11}\,x_{12}\,x_{13}\,x_{14}\,x_{15}\,...\\
0\,.\,x_{20}\,x_{21}\,x_{22}\,x_{23}\,x_{24}\,x_{25}\,...\\
0\,.\,x_{30}\,x_{31}\,x_{32}\,x_{33}\,x_{34}\,x_{35}\,...\\
0\,.\,x_{40}\,x_{41}\,x_{42}\,x_{43}\,x_{44}\,x_{45}\,...\\
0\,.\,x_{50}\,x_{51}\,x_{52}\,x_{53}\,x_{54}\,x_{55}\,...
\]
Cantor noted that if the reals were countable we could construct a list of them as they are generated by the mapping from the integers. Figure 3 illustrates what this list might look like for just the numbers between 0 and 1.
Now starting after the decimal point in the first row and moving diagonally down and to the right we can construct a new number
\[
0\,.\,\hat{x}_{00}\,\hat{x}_{11}\,\hat{x}_{22}\,\hat{x}_{33}\,\hat{x}_{44}\,\hat{x}_{55}\,...
\]
where each \(\hat{x}_{ii}\) is specifically chosen to be different to \(x_{ii}\).This number is clearly between 0 and 1, but must differ from every number in the list at no less than one digit. We have thus found a number between 0 and 1 that was not in our list and hence the list is incomplete. It is not, therefore, possible to construct such a list and hence the reals, and consequently the transcendentals, are uncountably infinite. Being more sizable than the other numbers, their cardinal number is denoted by \(\aleph_1\).
Neat, huh?
IEEE 7541985
So now we know the mathematical classification of numbers we are ready to start looking at how we might implement numeric types with computers.The IEEE standard defines floating point numbers to have a format similar to the scientific notation many of us will recognise from our calculators and spreadsheets. In the familiar decimal base 10 this means a number between 1 and 10 multiplied by 10 raised to the power of another number.
For example, the number of days in a year is approximately 365. Dividing by this 100 gives us a number between 1 and 10, namely 3.65. Since 100 is 10 raised to the power of 2, the number of days in the year can be written as
The number that we multiply by the power of 10 is known as the significand and the power of 10 by which we multiply it is known as the exponent.
Since base 10 is rather inconvenient from a computing perspective, IEEE floating point numbers are defined in the binary base 2. Specifically, numbers are defined as
There is, in fact, a special case when we assume the leading digit is 0 rather than 1. This occurs when the exponent takes on its most negative value, yielding the very smallest floating point numbers. Since the leading digits of these numbers, known as subnormal or denormalised numbers, are 0 they have fewer significant digits of accuracy, or equivalently have lower precision. In contrast recall that normal numbers have an implied leading digit of 1 and consequently have the full 24 bits with which to represent the significand.
Figure 4: IEEE 754 Single Precision Floating Point
±a_{1}a_{2}a_{3}...a_{8}b_{1}b_{2}b_{3}...b_{23}  
Binary Exponent (a_{1}...a_{8})  Decimal Exponent  Binary Value 
00000000  0  ±0.b_{1}b_{2}b_{3}...b_{23}×2^{126} 
00000001  1  ±1.b_{1}b_{2}b_{3}...b_{23}×2^{126} 
00000010  2  ±1.b_{1}b_{2}b_{3}...b_{23}×2^{125} 
00000011  3  ±1.b_{1}b_{2}b_{3}...b_{23}×2^{124} 
...  ...  ... 
01111111  127  ±1.b_{1}b_{2}b_{3}...b_{23}×2^{0} 
10000000  128  ±1.b_{1}b_{2}b_{3}...b_{23}×2^{1} 
...  ...  ... 
11111100  252  ±1.b_{1}b_{2}b_{3}...b_{23}×2^{125} 
11111101  253  ±1.b_{1}b_{2}b_{3}...b_{23}×2^{126} 
11111110  254  ±1.b_{1}b_{2}b_{3}...b_{23}×2^{127} 
11111111  255 
±∞ if b=0 NaN otherwise 
In addition to the normal and subnormal numbers, IEEE 754 defines bit patterns to represent the positive and negative infinities and a set of error values for invalid calculations known as the NaNs, for Not a Number.
Figure 4 enumerates the full set of IEEE 754 single precision floating point numbers.
Note that since the significand is finite, floating point numbers are actually a finite subset of the rational numbers and it is vitally important not to confuse them with real numbers.
Double Precision
Double precision floating point numbers have precisely the same layout as single precision floating point numbers, differing only in that they have an 11 bit exponent and a 53 bit significand. Recall that one of the bits in the significand is implied, so that these and the sign bit fill 64 bits.From this point on I shall assume that we're using double precision numbers. Given that JavaScript numbers are of this type I couldn't give examples in single point anyway.
Now that we have covered the mundane implementation details of floating point numbers it is time to start looking at the rather more important topic of their precise behaviour.
Not a Number
NaNs infect any calculation they come into contact with since the result of any operation upon a NaN yields a NaN.Furthermore, NaNs always compare unequal, even to themselves. Every comparison operator other than not equal are similarly always false if they involve NaNs. If you keep this in mind when designing loops and branches, you can ensure that your algorithms will behave predictably in the face of invalid arithmetic operations.
Note that whilst there are many NaNs, JavaScript only supports one of them.
Overflow
Floating point numbers overflow in a satisfyingly predictable way, namely to plus or minus infinity.Dividing any finite number by an infinity will yield zero and dividing any nonzero number by zero will yield an infinity of the same sign as that number. Adding or subtracting any finite number to or from an infinity will result in that infinity.
These properties mean that many numerical algorithms can implicitly cope with numerical overflow since arithmetic operations and comparisons are internally consistent and, accompanied by some vigorous hand waving, mathematically sound.
Note that dividing 0 by 0, dividing an infinity by an infinity, multiplying an infinity by 0 and subtracting an infinity from itself all yield NaNs.
Rounding Error
One of the most common surprises facing the programmer using floating point arithmetic stems from the fact that there are a fixed number of bits with which to represent the significand.We can illustrate the problem by considering decimal notation. Say we restrict ourselves to 4 figures after the decimal point. Assuming that we have chosen the closest number in this representation, x, to a given number we can only say that its true value lies somewhere within x±5E5. For example, given π to 4 decimal places, 3.1416, we can only state with certainty that it lies between 3.14155 and 3.14165.
Similarly, for an IEEE double precision floating point number x with an exponent of a, we can only be sure that the true value lies between x±2^{a53}. Conveniently, since normalised floating point numbers have a implicit leading digit of 1, these bounds can be written as x×(1±2^{53}) or, conventionally, as x×(1±½ε). The value of ε is provided in the
ak
library with ak.EPSILON
.Of course, this means that operations on denormalised numbers will introduce proportionally even greater errors but we shall ignore this fact in our analyses and effectively treat them as if they behave in the same fashion as zero.
If an algorithm really must treat denormalised numbers with the same respect as normalised numbers, it will require much more careful analysis.
The mathematical operations of addition, subtraction, multiplication, division, remainder and square root are required by the IEEE standard to be accurate to within rounding error. Specifically, they must return the correctly rounded representation of the result of performing the actual calculation with real numbers. This means that, if using round to nearest, they will introduce a proportional error no larger than (1±½ε).
Note that because of these accumulated rounding errors, equality comparisons between floating point numbers often behave counterintuitively; values of unlike expressions that should mathematically be equal may have accumulated slightly different rounding errors.
In general, we should prefer to test whether two floating point numbers are similar to each other rather than the same.
Condition Number
It is important to note that the rounding guarantees of the IEEE arithmetic operations do not take into account any rounding error in their arguments.
Derivation 1: The Condition Number
Given a real number \(x\), a floating point approximation of it \(\hat{x}\) has a relative error of
\[
\epsilon_x = \left\frac{x  \hat{x}}{x}\right
\]
and \(f(\hat x)\) has a relative error of
\[
\epsilon_f = \left\frac{f(x)  f(\hat{x})}{f(x)}\right
\]
Dividing by the latter by the former we have
\[
\begin{align*}
\left\frac{f(x)  f(\hat{x})}{f(x)}\right \Big/ \left\frac{x\hat{x}}{x}\right
&= \left\frac{f(x)  f(\hat{x})}{f(x)} \Big/ \frac{x\hat{x}}{x}\right\\
&= \left\frac{f(x)  f(\hat{x})}{f(x)} \times \frac{x}{x\hat{x}}\right\\
&= \left\frac{f(x)  f(\hat{x})}{x\hat{x}} \times \frac{x}{f(x)}\right\\
&\approx \leftf'(x) \times \frac{x}{f(x)}\right\\
&= \left\frac{f'(x)}{f(x)} \times x\right
\end{align*}
\]
\[
\kappa_f(x) = \left \frac{f'(x)}{f(x)} \times x \right
\]
where \(f'\) is the derivative of \(f\) and the vertical bars mean the absolute value of the expression between them.This value is approximately equal to the absolute value of the ratio between the relative error of \(f\), \(\epsilon_f\), and that of its argument \(x\), \(\epsilon_x\), as shown in derivation 1. Note that \(f\) is assumed to be exact; the condition number captures the sensitivity of mathematical functions to errors in their arguments, not numerical errors in the implementations of such functions.
As an example, consider the exponential function \(e^x\) whose derivative is equal to \(e^x\) for all \(x\). Its condition number is therefore \(x\) and the absolute value of its relative error is approximately equal to \(x \times \epsilon_x\).
When the condition number is large, a function is said to be poorly conditioned and we cannot trust that its results are accurate to many digits of precision.
We can think of the base \(b\) logarithm of a number \(x\) as the (possibly fractional) position of its most significant digit in that base since \(x = b^{\log_b(x)}\). Consequently, a relative error of \(\epsilon_x\) implies that a value is accurate to \(\log_b(\epsilon_x)\) digits as this is the difference between the positions of the most significant digits of the true value and of the error
\[
\begin{align*}
\log_b(x)  \log_b(\epsilon_x \times x) &= \log_b(x)  \log_b(\epsilon_x)  \log_b(x)\\
&= \log_b(\epsilon_x)
\end{align*}
\]
Noting that
\[
\epsilon_f = \epsilon_x \times \kappa_f(x)
\]
and taking logs and negating
\[
\log_b(\epsilon_f) = \log_b(\epsilon_x)  \log_b(\kappa_f(x))
\]
we can see that the logarithm of the condition number represents how many fewer digits of precision the result of the function has than its argument.
Cancellation Error
Given enough operations or poorly conditioned functions, rounding error can significantly affect the result of a calculation, but it is by no means the worst of our troubles.Far more worrying is cancellation error which can yield catastrophic loss of precision. When we subtract two almost equal numbers we set the most significant digits to zero, leaving ourselves with just the insignificant, and most erroneous, digits.
For example, suppose that we have two values close to \(\pi\), 3.1415 and 3.1416. These values are both accurate to 5 significant figures, but their difference is equal to 0.0001, or 1.0E4, and has just 1 significant figure of accuracy.
Whilst rounding error might sneak up upon us in the end, cancellation error is liable to beat us about the head with a length of two by four.
Note that since cancellation error results from the dramatic sensitivity of the subtraction of nearly equal numbers to the rounding errors in those numbers, it can be captured by the condition number.
For example, the expression \(x1\) has a condition number of \(x/(x1)\). As \(x\) tends to 1, the condition number tends to infinity, reflecting the growing effect of cancellation error on the result.
The poster child of cancellation error is the numerical approximation of differentiation with the forward finite difference. The derivative of a function \(f\) at a point \(x\) can be defined (albeit with more handwaving than a royal procession) as the limit, if one exists, of
\[
\frac {f(x + \delta)  f(x)} \delta
\]
as \(\delta\) tends to zero.The forward finite difference replaces the limit with a very small, but nonzero, \(\delta\) and is a reasonably obvious way to approximate the derivative. It is equally obvious that we should choose \(\delta\) to be as small as possible, right?
Wrong!
To demonstrate why not, consider the function \(e^x\) whose derivative at 1 is trivially equal to \(e\). Program 1 plots the number of accurate binary digits in the finite difference approximation against the position of the most significant binary digit in \(\delta\), with a delay added to the loop for effect.
Program 1: Finite Difference Errors



Clearly, decreasing \(\delta\) works up to a point as indicated by an initial linear relationship between the number of leading zeros and the number of accurate bits. However, this relationship seems to break down beyond \(\delta\) equal to 2^{25} and the best accuracy occurs when \(\delta\) is equal to 2^{26}.
This is suspiciously close to half of the number of bits in the significand of a double precision floating point number. In fact it can be proven, under some simplifying assumptions, that the optimal choice for \(\delta\) is the square root of \(\epsilon\), which has roughly that many leading zeros.
However, to do so requires Taylor's theorem, which we must therefore cover first.
Taylor's Theorem
For those who seek to develop numerical algorithms, Taylor’s theorem is perhaps the single most useful result in mathematics.It demonstrates how a sufficiently differentiable function can be approximated within some region to within some error by a polynomial. Specifically, it shows that
\[
f(x+\delta) = f(x) + \delta \, f'(x) + \frac 1 2 \delta^2 \, f''(x) + ... + \frac 1 {n!} \delta^n \, f^{(n)}(x) + O\left(\delta^{n+1}\right)
\]
where we are using the notational convention of \(f'(x)\) for the derivative of \(f\) evaluated at \(x\), \(f''(x)\) for the second derivative and \(f^{(n)}(x)\) for the \(n^{th}\) derivative. The exclamation mark is the factorial of, or the product of every integer from 1 up to and including, the value preceding it. \(O(\delta^{n+1})\) is an error term of order \(\delta^{n+1}\) or, in other words, is for any given \(f\) and \(x\) is equal to some constant multiple of \(\delta^{n+1}\). By sufficiently differentiable we mean that all of the derivatives of \(f\) up to the \((n+1)^{th}\) must exist.Proof of this theorem is beyond the scope of this site so, as averse as I am to unfounded assertions, you shall simply have to take my word for it this time. I will, however, demonstrate it by applying it to the exponential function, whose derivative is equal to itself, with \(x=0\) and \(\delta=1\)
\[
e^1 \approx e^0 + 1 \times e^0 + \frac 1 2 \times 1^2 \times e^0 + ... + \frac 1 {n!} \times 1^n \times e^0
\]
or
\[
e \approx 1 + 1 + \frac 1 2 + ... + \frac 1 {n!}
\]
The difference between \(e\) and such approximations to it for various \(n\) are illustrated by program 2.
Program 2: Approximation Error



In fact, the error term has exact bounds given by a more accurate statement of Taylor’s theorem
\[
f(x+\delta) = f(x) + \delta \, f'(x) + \frac 1 2 \delta^2 \, f''(x) + ... + \frac 1 {n!} \delta^n \, f^{(n)}(x) + \frac 1 {(n+1)!} \delta^{n+1} \, f^{(n+1)}(y)
\]
Derivation 2: Finite Difference Error
By Taylor's theorem, we have
Rearranging yields
\[
f(x+\delta) = f(x) + \delta \, f'(x) + O(\delta^2)
\]
Assuming that we can exactly represent both \(x\) and \(x+\delta\) and that \(f\) is accurate to machine precision, the floating point result of the finite difference will be
\[
(1 \pm \tfrac 1 2 \epsilon)\frac{(1 \pm \tfrac 1 2 \epsilon)\left[(1 \pm \tfrac 1 2 \epsilon) f(x+\delta)  (1 \pm \tfrac 1 2 \epsilon) f(x)\right]} {\delta}
\]
since each arithmetic operation introduces a rounding error.Rearranging yields
\[
(1 \pm \tfrac 1 2 \epsilon)^2 \frac{(1 \pm \tfrac 1 2 \epsilon) f(x+\delta)  (1 \pm \tfrac 1 2 \epsilon) f(x)} {\delta}
\\
\; = (1 \pm \tfrac 1 2 \epsilon)^2 \frac{(1 \pm \tfrac 1 2 \epsilon) (f(x)+\delta\,f'(x)+O(\delta^2))  (1 \pm \tfrac 1 2 \epsilon) f(x)} {\delta}
\\
\; = (1 \pm \tfrac 1 2 \epsilon)^2 \frac{\pm \epsilon \, f(x) + \delta \, f'(x) \pm \tfrac 1 2 \epsilon \, \delta \, f'(x) + O(\delta^2)} {\delta}
\\
\; = (1 \pm \tfrac 1 2 \epsilon)^2 \, \left(f'(x) + O(\delta) + O\left( \frac \epsilon \delta \right) + O(\epsilon) \right)
\\
\; = f'(x) + O(\delta) + O\left( \frac \epsilon \delta \right) + O(\epsilon)
\]
Derivation 2 uses Taylor's theorem to prove that the finite difference has an approximation error of
\[
O(\delta) + O\left( \frac \epsilon \delta \right) + O(\epsilon).
\]
If \(\delta\) is too large the \(O(\delta)\) term will introduce significant errors into the approximation, whereas if it is too small the \(O(\epsilon / \delta)\) term will do so instead.With some vigorous hand waving, we can ignore the constant factors in these terms, and conclude that since a choice of \(\delta = \sqrt{\epsilon}\) results in them both having the same order of magnitude it is, in some sense, optimal.
Comparing Floats
Comparing floating point numbers isn't as straightforward as comparing integers since we must take into account the errors that inevitably creep into our calculations. Such errors make it rather difficult to nail down exactly what we mean when we ask whether or not two floating point numbers are equal so, in general, we prefer to ask whether or not they are similar. The question is whether we should measure similarity with absolute or relative differences.Since floating point numbers have a fixed precision, the relative difference would seem to be the preferable measure of similarity. Unfortunately it amplifies the difference between numbers that are close to zero; exactly those numbers that might be suffering from cancellation error.
If the inputs to our calculations are not too many of orders of magnitude away from unity, as will generally be the case given our predisposition for units that ensure measurements are neither too large nor too small, then we should prefer relative differences for results much greater than one and absolute differences for results much smaller than one.
The function
\[
d(x, y) = \frac {xy} {1 + \min(x, y)}
\]
tends to the absolute difference when either of its arguments is much smaller than one and to the relative difference when both are much larger than one. An implementation of this function is provided by ak.diff
.
Order Of Execution
An important thing to note about floating point arithmetic is that the order in which operations are performed can have a material effect upon the accuracy of the result.For example, suppose that we wish to calculate the cube of the square root of a number, or equivalently the square root of the cube. Starting out with a value \(x\) accurate to within rounding, we have
\[
(1 \pm \tfrac 1 2 \epsilon)\,x
\]
Next, we take the square root, introducing another rounding error
\[
(1 \pm \tfrac 1 2 \epsilon)\left((1 \pm \tfrac 1 2 \epsilon)\,x\right)^\frac 1 2 = (1 \pm \tfrac 1 2 \epsilon)^\frac 3 2\,x^\frac 1 2
\]
Finally we multiply it by itself twice to recover the cube, introducing two more rounding errors
\[
\begin{align*}
(1 \pm \tfrac 1 2 \epsilon)^2 \, \left((1 \pm \tfrac 1 2 \epsilon)^\frac 3 2\,x^\frac 1 2\right)^3
&= (1 \pm \tfrac 1 2 \epsilon)^\frac {13} 2 \, x^\frac 3 2\\
&= (1 \pm 3{\tfrac 1 4} \epsilon) \, x^\frac 3 2 + O(\epsilon^2)
\end{align*}
\]
Now let’s try it in reverse order. This time we perform the two multiplications first yielding
\[
(1 \pm \tfrac 1 2 \epsilon)^2 \, \left((1 \pm \tfrac 1 2 \epsilon) \, x\right)^3
= (1 \pm \tfrac 1 2 \epsilon)^5 \, x^3
\]
Secondly we take the square root, introducing one additional rounding error
\[
\begin{align*}
(1 \pm \tfrac 1 2 \epsilon) \, \left((1 \pm \tfrac 1 2 \epsilon)^5 \, x^3\right)^\frac 1 2
&= (1 \pm \tfrac 1 2 \epsilon)^\frac 7 2 \, x^\frac 3 2\\
&= (1 \pm 1{\tfrac 3 4} \epsilon) \, x^\frac 3 2 + O(\epsilon^2)
\end{align*}
\]
Surprisingly, this second version of the calculation has accumulated just a little more than half the error that the first had.Whilst this is a relatively simple example, the fundamental lesson is sound; in order to control errors when using floating point numbers we must plan our calculations with care.
You're Going To Have To Think!
A few years ago I read a comment on a prominent IT internet forum proposing that scientists should not be trusted to implement their own computer models since, presumably unlike the comment’s author, they are not trained in computer science and are consequently likely to make rookie mistakes. The given example of such a mistake was using floating point arithmetic to calculate the average of twenty or so values in the low twenties which was ironic, since a computer scientist should be able to demonstrate that with double precision the result would be correct to about 15 decimal digits of precision!Such unfair criticism of floating point is not particularly unusual, is often unduly concerned with rounding error and hardly ever mentions the vastly more important topic of cancellation error. One can only assume that many computer science graduates have forgotten their numerical computing lectures and have generalised the very specific and predictable failure modes of floating point arithmetic to the rule of thumb that it should be avoided entirely.
These criticisms are generally accompanied by suggestions of alternative arithmetic types that fix the perceived problems with floating point. We shall investigate these in coming posts and we shall learn that if we wish to use computers for arithmetic calculation we shall have to accept the fact that we are going to have to think.
References
[1] Crockford, D. JavaScript: The Good Parts, O'Reilly, 2008.[2] IEEE standard for binary floatingpoint arithmetic. ANSI/IEEE std 7541985, 1985.
Based upon an article I wrote for ACCU's Overload magazine.
October 14, 2013 5:11 PM
Good article!
But the cardinality of the reals being equal to aleph_1 isn't (always) true. In fact, it is a statement which is neither provable nor disprovable.
(some of these captchas are really hard :)
October 15, 2013 8:30 PM
The only undecidable proposition that I'm aware of regarding transfinite cardinals is the continuum hypothesis which states that no set has a cardinality greater than \(\aleph_0\) and less than \(\aleph_1\). That the cardinality of the reals is \(\aleph_1\) is proven.
Of course, if you're an intuitionist (I'm reading between the lines a little here) then you'd reject my preferred axiomatic system and we're not going to agree upon that. Being a computer programmer, formalism is a much more appealing philosophy for me ;)
I'm not keen on the captchas either. Unfortunately I'm getting a lot of posts hawking fake designer handbags of all things :/ I'm working on a blocklist, so hopefully the captchas are a temporary inconvenience.
October 16, 2013 8:54 PM
It's just occurred to me that you were using \(\aleph_1\) to represent the first transfinite cardinal after \(\aleph_0\) rather than the provably uncountable transfinite cardinal of the reals.
I would expect the notation to keep the meaning of \(\aleph_0\) as the countable infinity and \(\aleph_1\) as the uncountable infinity and use something like \(\aleph_{0.5}\) for the countableish infinity, hence my misunderstanding.