# On The Shoulders Of Gradients

In the last post[1] we implemented the bisection method for the numerical approximation of the inverse of a function. Now this is an extremely stable algorithm, but annoyingly not a particularly efficient one.
Fortunately, the great Isaac Newton is coming to our rescue with a significantly more efficient approach; given the derivative $$f^\prime$$ of a function $$f$$ we can approximate it in the neighbourhood of a value $$x$$ with the straight line
Figure 1: Linear Approximation

Figure 2: The Next Step
$f(x+\delta) \approx f(x) + \delta f^\prime(x)$
as illustrated by figure 1 and which, if you've been paying attention, you will recognise as the first two terms of the Taylor series of $$f$$.

Now, if we know the values of $$f(x)$$ and $$f^\prime(x)$$ and seek $$\delta$$ such that $$f(x+\delta)=y$$ we have
\begin{align*} y &\approx f(x) + \delta f^\prime(x)\\ \delta &\approx \frac{y - f(x)}{f^\prime(x)} \end{align*}
Being an approximation this is unlikely to hit the inverse exactly, as can be seen in figure 1 in which the linear approximation has overshot a zero of the function. But that's OK, we can simply replace $$x$$ with $$x+\delta$$ and try again as illustrated by figure 2 in which, whilst we have overshot the zero again, we are clearly getting closer to it.

### Newton's Method

This iterative scheme is known, appropriately enough, as Newton's method and isn't really that much more complicated than the bisection method. It's certainly simple enough that we can find $$x$$ such that $$x^2=2$$ by hand again. Noting that the derivative of $$x^2$$ is $$2x$$, at each step we should update $$x$$ with
$x + \frac{2 - x^2}{2x} \to x$
or, rearranging the terms, with
$\frac{x}{2} + \frac{1}{x}\to x$
Starting at $$x=2$$ and working to seven significant figures, we proceed with

$$i$$      $$x$$      $$\frac{x}{2} + \frac{1}{x}$$      $$\left(\frac{x}{2} + \frac{1}{x}\right)^2$$
1      2.000000      1.500000      2.250000
2      1.500000      1.416667      2.006945
3      1.416667      1.414216      2.000007
4      1.414216      1.414214      2.000001

which you can easily confirm if you happen to have a calculator to hand.
In our worked example of the bisection method, despite using just three significant figures and starting much closer to the inverse with a bracket of $$[1.3, 1.6]$$, we took five steps to converge; quite the testament to the improvement that Newton hath wrought!

Derivation 1: Newton's Method's Convergence Rate
If there is a point at which the function has the value $$y$$ then, for some unknown $$\delta$$ and $$c$$, we must have
$y = f(x+\delta) = f(x) + \delta f^\prime(x) + c\delta^2$
Newton's method approximates $$\delta$$ with
$\Delta = \frac{y - f(x)}{f^\prime(x)}$
Substituting the former into the latter yields
\begin{align*} \Delta &= \frac{f(x) + \delta f^\prime(x) + c\delta^2 - f(x)}{f^\prime(x)}\\ &= \frac{\delta f^\prime(x) + c\delta^2}{f^\prime(x)}\\ &= \delta + \frac{c}{f^\prime(x)}\delta^2 \end{align*}
and so after we take the step the distance to the inverse will be
$\frac{c}{f^\prime(x)}\delta^2$
So if our starting point was sufficiently close to $$y$$, or in other words $$\delta$$ was sufficiently small, and the constant $$c / f^\prime(x)$$ isn't too large, then after the step the approximation will be accurate to roughly twice as many digits as it was before; a property known as quadratic convergence.
To understand why Newton's method is so efficient, derivation 1 uses Taylor's theorem to show that, given some stability conditions, once we get close to the inverse each step will roughly double the number of digits of accuracy of the approximation.

Noting that the square root of two equals $$1.414214$$ to seven significant figures, we can see this behaviour in our worked example. Representing the number of digits to which the approximation equals the correct value with $$n$$, we have

$$i$$      $$\frac{x}{2} + \frac{1}{x}$$      $$n$$
1      1.500000      1
2      1.416667      3
3      1.414216      6
4      1.414214      7

Clearly the number of accurate digits more or less doubles at each step, at least right up until we reach the limit of precision.

### If It Seems Too Good To Be True...

Now this is an extremely desirable property of the algorithm, but unfortunately it does rather depend upon starting somewhere close to the actual inverse and the aforementioned stability conditions. We can demonstrate a failure of Newton's method by using it to find the value of the inverse of the cubic function
Figure 3: -x3 + 3x2 - x + 2
$f(x) = -x^3 + 3x^2 - x + 2$
at one, which falls a little below three, as illustrated by figure 3. The derivative of this function is
$f^\prime(x) = -3x^2 + 6x - 1$
giving us an update step of
$x + \frac{1 + x^3 - 3x^2 + x - 2}{-3x^2 + 6x - 1} \to x$
or
$\frac{2x^3 - 3x^2 + 1}{3x^2 - 6x + 1} \to x$
Starting at $$x=1.5$$ and working again to seven figures, we have

$$i$$      $$x$$      $$\frac{2x^3 - 3x^2 + 1}{3x^2 - 6x + 1}$$      $$f\left(\frac{2x^3 - 3x^2 + 1}{3x^2 - 6x + 1}\right)$$
1      1.500000      -0.800000      5.232000
2      -0.800000      -0.251813      2.458010
3      -0.251813      0.287969      1.936929
4      0.287969      -1.667897      16.653428
5      -1.667897      -0.859061      5.706993
6      -0.859061      -0.296584      2.586558
7      -0.296584      0.224729      1.915431
8      0.224729      -4.425323      151.839015

The problem is that, having started too far away from the inverse, the gradient of the function implies that it lies to the left rather than to the right. Unfortunately, the function reverses direction before crossing the target value of one; specifically at around $$0.18$$ where the derivative is equal to zero. Newton's method begins to converge on this turning point and, when it gets close to it and the derivative gets very small, takes a massive step away from the current estimate, as happened in steps $$4$$ and $$8$$, throwing away all of our hard work.
Eventually our luck may change and an estimate will end up just below the turning point, leading to a massive step to the right after which the algorithm will converge upon the correct value. This is not guaranteed, however, and Newton's method can get stuck in an unending cycle, never to converge. Which is somewhat disappointing, to be perfectly honest.

### Be Responsible And Get Yourself A Hybrid

Given that the bisection method doesn't suffer from this problem, it is tempting to use a hybrid approach to get the best of both algorithms. Specifically, starting with an initial bracket, we shall take a step of Newton's method from the endpoint closest to the target value of the function. If that step falls within the bracket we shall update it as we did for the bisection method, replacing the endpoint at which the function falls on the same side of the target. If not, we shall simply take a step of the bisection method instead.

Figure 4: An Unfortunate Starting Point
Now this is an example of a safeguarded[2] inversion algorithm and solves the problem of Newton's method flying off into the wild blue yonder, but unfortunately there is another problem that we must address; if the gradient at the starting point is significantly steeper than that at the inverse, Newton's method will take a much smaller step than is required, as illustrated by figure 4.
In such situations, Newton's method will take much longer to converge than the bisection method, so we should like to switch to the latter in this event too. We can do so by taking an additional bisection method step if Newton's method doesn't sufficiently reduce the difference between the value of the function at the estimated inverse and the target value.

### ak.newtonInverse

This is the scheme that we shall use for our implementation, ak.newtonInverse, given in listing 1.

Listing 1: ak.newtonInverse
ak.newtonInverse = function(f, threshold) {
threshold = ak.nativeType(threshold)===ak.UNDEFINED_T ? Math.pow(ak.EPSILON, 0.75)
: Math.abs(threshold);

if(isNaN(threshold)) {
throw new Error('invalid convergence threshold in ak.newtonInverse');
}

return function(y, hint) {
return inverse(f, y, ak.bracketInverse(f, y, hint), threshold);
};
};


This has exactly the same structure as ak.bisectInverse, providing a default convergence threshold if one isn't provided and returning a function that uses ak.bracketInverse to search for an initial bracket and forwards to the inverse function given in listing 2.

Listing 2: inverse
function inverse(f, y, x, eps) {
var fx = [f(x[0]), f(x[1])];
var i  = Math.abs(fx[0][0]-y)<Math.abs(fx[1][0]-y) ? 0 : 1;
var m, fm, bi, fn;

if(ak.diff(fx[i][0], y)<=eps) return x[i];

if(fx[0][0]>y) {
m =  x[0];  x[0] =  x[1];  x[1] = m;
m = fx[0]; fx[0] = fx[1]; fx[1] = m;
i = 1-i;
}

m  = x[i];
fm = fx[i];

while(ak.diff(x[0], x[1])>eps) {
m += (y-fm[0])/fm[1];
bi = m<=Math.min(x[0], x[1]) || m>=Math.max(x[0], x[1]);

if(!bi) {
fn = f(m);
bi = Math.abs(fn[0]-y)>Math.abs(fm[0]-y)/2;
update(x, fx, m, fn, y, eps);
}

if(bi && x[0]!==x[1]) bisect(f, x, fx, y, eps);

i  = y-fx[0][0]<fx[1][0]-y ? 0 : 1;
m  = x[i];
fm = fx[i];
}
return m;
}


The first thing that we should note is that the function f is expected to return an array whose first element is the value of the function and the second is the value of it's derivative. That this is exactly what's returned by the derivs method of our ak.surreal type[3] is no coincidence, and is the reason why the implementation of ak.bracketInverse treated the first element of an array as equivalent to a value.

As we did for the bisection method, we first check that the target y is valid and bail out if the ak.diff[4] normalised difference between it and the closest value of the function at the bracket endpoints, which we've already indexed with i, falls within the convergence threshold. Note that we get the function's value from the first element of the array fx[i].
Next, after ensuring that the first element of the bracket lies below the target and that the second lies above it, we start applying the hybrid algorithm. The first thing we do in the body of the loop is take a Newton's method step from the closest endpoint, getting the derivative from the second element of the array fm. We use bi to record whether or not it falls outside of the bracket and that we should or should not consequently perform a bisection step instead. If it doesn't, we use bi to record whether the step failed to sufficiently reduce the distance to the target, addressing the second problem that we identified above, and then update the bracket.
If bi indicates that a bisection method step is required for either reason, we take one with bisect and finally, we update the index of the closest endpoint of the bracket to the target.

The update and bisect functions are much the same as those we used for the bisection method, as illustrated by listing 3.

Listing 3: update And bisect
function update(x, fx, m, fm, y, eps) {
var i = fm[0]>y ? 1 : 0;

if(ak.diff(fm[0], y)>eps) {
x[i]  = m;
fx[i] = fm;
}
else {
x[0] = x[1] = isNaN(fm[0]) ? ak.NaN : m;
}
}

function bisect(f, x, fx, y, eps) {
var m = x[0]!==x[1] ? x[0]/2 + x[1]/2 : x[0];

if(m===x[0] || m===x[1]) x[0] = x[1] = y-fx[0][0]<fx[1][0]-y ? x[0] : x[1];
else update(x, fx, m, f(m), y, eps);
}


Once again, the update function bails out by setting both endpoints of the bracket to the new value if the value of the function is sufficiently close to the target, terminating the main loop.
The bisect function again avoids overflow by adding halfs of the endpoints to find the midpoint and terminates in a similar fashion to update in the event that it is equal to either to within rounding error. The biggest difference between this and our previous implementation is that it calls update itself rather than leave it to the body of the loop.

The implementation of ak.newtonInverse can be found in NewtonInverse.js and its use is demonstrated by program 1, which, as we did for ak.bisectInverse, approximates the inverse of the exponential function.

Program 1: ak.newtonInverse

Note that we are using the derivs method of ak.surreal to automate the calculation of the value/derivative array that ak.newtonInverse expects the function to return.
It's somewhat difficult to see that this algorithm is indeed more efficient than ak.bisectInverse given that they both return so quickly, but it does at least demonstrate its accuracy. Of course, I would again urge you to try it out on a few other functions to convince yourself that it is an effective general purpose approximation of the inverse.

Now, whilst ak.surreal makes the accurate numerical approximation of derivatives reasonably painless, it is not always convenient or, in fact, appropriate. For one thing, if we're numerically approximating a function then there's a good chance that the surreal version of it will not accurately approximate the derivatives.
Ideally then, we should like something akin to Newton's method, but which doesn't explicitly require derivatives. Fortunately, just such an algorithm exists and we shall take a look at it in the next post.

### References

[1] Rooting Around For Answers, www.thusspakeak.com, 2014.

[2] Gill, P., Murray, W. & Wright, M. Practical Optimization, Academic Press, 1981.

[3] You're Going To Have To Think! Why Automatic Differentiation Won't Cure Your Calculus Blues., www.thusspakeak.com, 2014.

[4] Climbing Mount Impractical, www.thusspakeak.com, 2013.