That Shrinking Feeling

| 0 Comments

In the last few posts I described the bisection method[1], Newton's method[2] and the secant method[3] for numerically approximating the inverses of functions. Specifically, given a function \(f\) and some target value \(y\) these algorithms seek some argument \(x\) such that \(f(x)=y\), at least to within some given error threshold.

Related to the problem of inverting functions is the problem of minimising them, in which we seek arguments at which functions return their least values. Specifically, given a function \(f\) and an interval \([a, b]\) we seek some \(x \in [a, b]\) such that
\[ \forall y \in [a, b] \quad f(y) \geqslant f(x) \]
Recall that \(\in\) means within and \(\forall\) means for all.

Figure 1: A Local Minimum
An additional difficulty that minimisation presents is that we can't generally know in advance what the least value of the function actually is. As a consequence we must usually content ourselves with an argument \(x\) at which the function takes a value lesser than at all nearby arguments, as illustrated by figure 1 in which the circled point is lesser than those immediately to its left and right but greater than many of those falling past the peak to its left.

Such arguments are known as local minima and are defined by
\[ \exists \epsilon>0 \quad \forall y \in [x-\epsilon, x+\epsilon] \quad f(y) \geqslant f(x) \]
where \(\exists\) means there exists.

Keep That Bracket Down

We can take some small comfort from the fact that we can at least guarantee that we'll find a local minimum by maintaining a bracket in much the same way that we did for the inverse; with bisection.

Now the definition of a bracket for minimisation is slightly different to that for inversion in that it is comprised of three points \(a\), \(b\) and \(c\) satisfying
Figure 2: Bracketing A Minimum
\[ \begin{align*} b &\in [a, c]\\ f(b) &\leqslant f(a)\\ f(b) &\leqslant f(c) \end{align*} \]
which must contain a minimum between its lower and upper bounds, as illustrated by figure 2.

To update the bracket, we bisect the interval defined by the midpoint and the endpoint with the lesser function value, which we can assume is \(a\) since we can simply swap the symbols \(a\) and \(c\) if not. Labelling the midpoint of \([a, b]\) with \(d\), one of the two following outcomes must occur
\[ \begin{align*} f(d) &\leqslant f(b)\\ f(d) &> f(b) \end{align*} \]
Figure 3: f(d) ≤ f(b)

Figure 4: f(d) > f(b)
In the first case we have
\[ \begin{align*} d &\in [a, b]\\ f(d) &\leqslant f(a)\\ f(d) &\leqslant f(b) \end{align*} \]
and so \(a\), \(d\) and \(b\) form a new bracket, as illustrated by figure 3.
In the second case we have
\[ \begin{align*} b &\in [d, c]\\ f(b) &< f(d)\\ f(b) &\leqslant f(c) \end{align*} \]
and a new bracket is given by \(d\), \(b\) and \(c\), as illustrated by figure 4.
Crucially, in both cases the new bracket is smaller than the original and subsequently repeatedly bisecting the larger of the intervals \([a,b]\) and \([b,c]\) must converge. Note that if we begin with \(b\) equal to the midpoint of \([a,c]\) then by alternating between the left and right intervals in the bracket we can be sure to bisect the larger of them at each step.

As was the case with inversion, this algorithm is simple enough that we can demonstrate it with a worked example; finding the minimum of \(x^2\). The points \(a=-2\), \(b=1\) and \(c=4\) satisfy the conditions for a bracket
\[ \begin{align*} b &\in [a, c]\\ b^2 &\leqslant a^2\\ b^2 &\leqslant c^2 \end{align*} \]
We begin by bisecting the interval \([a, b]\). Working to two decimal places in the points, this gives a first step of

\(i\)     \(a \; \left(a^2\right)\)     \(b \; \left(b^2\right)\)     \(c \; \left(c^2\right)\)     \(\mathrm{bisect}\)     \(d \; \left(d^2\right)\)     \(\therefore\)
\(1\)     \(-2.00 \; (4.0000)\)     \(\phantom-1.00 \; (1.0000)\)      \(\phantom-4.00 \; (16.0000)\)      \([a,b]\)      \(-0.50 \; (0.2500)\)     \(b \to c, \; d \to b\)

which, since \(d^2 < b^2\), means that we should replace our bracket with \(a\), \(d\) and \(b\), as shown in the \(\therefore\) column.
In the next step we bisect \([b, c]\). Taking this and the following steps yields

\(i\)     \(a \; \left(a^2\right)\)     \(b \; \left(b^2\right)\)     \(c \; \left(c^2\right)\)     \(\mathrm{bisect}\)     \(d \; \left(d^2\right)\)     \(\therefore\)
\(2\)     \(-2.00 \; (4.0000)\)     \(-0.50 \; (0.2500)\)     \(\phantom-1.00 \; (1.0000)\)      \([b,c]\)     \(\phantom-0.25 \; (0.0625)\)     \(b \to a, \; d \to b\)
\(3\)     \(-0.50 \; (0.2500)\)     \(\phantom-0.25 \; (0.0625)\)     \(\phantom-1.00 \; (1.0000)\)      \([a,b]\)     \(-0.12 \; (0.0144)\)     \(b \to c, \; d \to b\)
\(4\)     \(-0.50 \; (0.2500)\)     \(-0.12 \; (0.0144)\)     \(\phantom-0.25 \; (0.0625)\)      \([b,c]\)     \(\phantom-0.06 \; (0.0036)\)     \(b \to a, \; d \to b\)
\(5\)     \(-0.12 \; (0.0144)\)     \(\phantom-0.06 \; (0.0036)\)     \(\phantom-0.25 \; (0.0625)\)      \([a,b]\)     \(-0.03 \; (0.0009)\)     \(b \to c, \; d \to b\)
\(6\)     \(-0.12 \; (0.0144)\)     \(-0.03 \; (0.0009)\)     \(\phantom-0.06 \; (0.0036)\)      \([b,c]\)     \(\phantom-0.02 \; (0.0004)\)     \(b \to a, \; d \to b\)
\(7\)     \(-0.03 \; (0.0009)\)     \(\phantom-0.02 \; (0.0004)\)     \(\phantom-0.06 \; (0.0036)\)      \([a,b]\)     \(\phantom-0.00 \; (0.0000)\)     \(b \to c, \; d \to b\)
\(8\)     \(-0.03 \; (0.0009)\)     \(\phantom-0.00 \; (0.0000)\)     \(\phantom-0.02 \; (0.0004)\)      \([b,c]\)     \(\phantom-0.01 \; (0.0001)\)     \(d \to c\)
\(9\)     \(-0.03 \; (0.0009)\)     \(\phantom-0.00 \; (0.0000)\)     \(\phantom-0.01 \; (0.0001)\)      \([a,b]\)     \(-0.02 \; (0.0004)\)     \(d \to a\)
\(10\)     \(-0.02 \; (0.0004)\)     \(\phantom-0.00 \; (0.0000)\)     \(\phantom-0.01 \; (0.0001)\)      \([b,c]\)     \(\phantom-0.00 \; (0.0000)\)     \(b \to a, \; d \to b\)
\(11\)     \(\phantom-0.00 \; (0.0000)\)     \(\phantom-0.00 \; (0.0000)\)     \(\phantom-0.01 \; (0.0001)\)      \([a,b]\)     \(\phantom-0.00 \; (0.0000)\)     \(b \to c, \; d \to b\)
\(12\)     \(\phantom-0.00 \; (0.0000)\)     \(\phantom-0.00 \; (0.0000)\)     \(\phantom-0.00 \; (0.0000)\)               \(\Box\)

where \(\Box\) indicates that the algorithm has converged.

ak.bisectMinimum

Our implementation proceeds in much the same way as our inversion functions, as shown in listing 1. The ak.bisectMinimum function takes a function f to be minimised and an optional convergence threshold and, after checking that they are valid, returns a function that searches for a minimum within an interval \([x_0, x_1]\).

Listing 1: ak.bisectMinimum
ak.bisectMinimum = function(f, threshold) {
 if(ak.nativeType(f)!==ak.FUNCTION_T) {
  throw new Error('invalid function in ak.bisectMinimum');
 }

 threshold = ak.nativeType(threshold)===ak.UNDEFINED_T ? Math.pow(ak.EPSILON, 0.75)
                                                       : Math.abs(threshold);
 if(isNaN(threshold)) {
  throw new Error('invalid convergence threshold in ak.bisectMinimum');
 }

 return function(x0, x1) {return minimum(f, x0, x1, threshold);};
};

Once again the real work is done by another function, minimise, to which the function, the bounds of the interval and the convergence threshold are forwarded, given in listing 2.

Listing 2: minimum
function minimum(f, x0, x1, eps) {
 var m  = x0!==x1 ? x0/2 + x1/2 : x0;
 var f0 = f(x0);
 var f1 = f(x1);
 var fm = f(m);
 var x  = [x0, m, x1];
 var fx = [f0, fm, f1];
 var i;

 if(!isFinite(m)) throw new Error('invalid argument in ak.bisectMinimum');
 for(i=0;i<3;++i) if(isNaN(fx[i])) fx[i] = ak.INFINITY;

 i = f0<f1 ? 0 : 2;
 while(ak.diff(x[0], x[2])>eps) {
  m  = bisect(x, fx, i);
  if(isNaN(fm = f(m))) fm = ak.INFINITY;
  update(x, fx, m, fm, i);
  i = 2-i;
 }
 if(f0<fx[1] || f1<fx[1]) return f0<f1 ? x0 : x1;
 return !isNaN(f(x[1])) ? x[1] : ak.NaN;
}

An important point to note is that since we're passing an interval rather than an actual bracket, our implementation must first find a bracket within that interval. As a first guess at a bracket we add the midpoint of the interval, storing it in x and its associated function evaluations in fx.
If either endpoint of the interval is infinite or NaN, which in all possible combinations yield a midpoint that is either infinite of NaN, then we shan't be able to progress so we must bail out immediately with an exception. In order to ensure that we step away from points at which the function returns NaN, we consequently treat NaNs as being larger than all finite results rather than incomparable to them by replacing them with ak.INFINITY.

The main body of the loop shrinks the bracket until the normalised difference between the endpoints falls below our convergence threshold eps. We store the index of the endpoint of the current bisection interval in i and use it to calculate the bisection point m with bisect and generate the next bracket with update, both of which are given in listing 3.

Listing 3: bisect And update
function bisect(x, fx, i) {
 var m = x[1]/2 + x[i]/2;

 if(m===x[1] || m===x[i]) m = x[0] = x[1] = x[2] = fx[1]<fx[i] ? x[1] : x[i];
 return m;
}

function update(x, fx, m, fm, i) {
 if(fm>fx[1]) {
  x[i]  = m;
  fx[i] = fm;
 }
 else {
  x[2-i] = x[1];
  x[1]   = m;

  fx[2-i] = fx[1];
  fx[1]   = fm;
 }
}

Note that, as in ak.bisectInverse, the bisect function is carefully implemented to behave itself when confronted with numerical rounding errors. In update, if fm is greater than the current midpoint then m simply replaces the endpoint of the bracket that's in the interval being bisected and otherwise the other endpoint, identified by 2-i, is replaced by the current midpoint and the current midpoint is replaced with m.
Now, this is precisely the bisection method which, as described above, requires that the initial three points form a valid bracket and at first glance it might seem that I have failed to provide the search for one that I claimed I would. However, somewhat surprisingly, it's also an effective method for finding a valid initial bracket!
Figure 5: Invalid Brackets

The Case Of The Unhinged Bracket

To analyse how the bisection method behaves for an invalid initial bracket we can again safely assume that \(f(a) \leqslant f(c)\) and it must then be the case that \(f(b) > f(a)\) or we'd have a valid bracket. Consequently there are two cases that we must consider
\[ \begin{align*} f(b) &> f(c)\\ f(b) &\leqslant f(c) \end{align*} \]
as shown by figure 5.

In the first case we have three possible outcomes for \(d\), the bisection of \([a,b]\)
Figure 6: The First Case

Figure 7: The Second Case
\[ \begin{align*} f(d) &\leqslant f(a) < f(b)\\ f(a) &< f(d) \leqslant f(b)\\ f(d) &> f(b) > f(c) \end{align*} \]
as illustrated by figure 6.
With the first of these \(a\), \(d\) and \(b\) form a valid bracket, so we can simply carry on with the bisection method as described. With the second we have a new example of the second case of an invalid initial bracket in \(a\), \(d\) and \(b\), as we do with the third in \(c\), \(b\) and \(d\), both of which are smaller than the original but still include one or the other of its endpoints.

Now the second case also has three possible outcomes for \(d\), shown in figure 7
\[ \begin{align*} f(d) &\leqslant f(a) < f(b)\\ f(a) &< f(d) \leqslant f(b)\\ f(b) &< f(d) \wedge f(b) \leqslant f(c) \end{align*} \]
where \(\wedge\) means and.
Once again the first of these yields a valid bracket with \(a\), \(d\) and \(b\). In the second we have another example of the second case of an invalid initial bracket with \(a\), \(d\) and \(b\), but again smaller than the original and including \(a\) as an endpoint. Finally, in the third, we have a valid bracket with \(d\), \(b\) and \(c\).

So each step must therefore either result in a valid bracket or an invalid one that is closer to one of the initial endpoints. In the event that a valid bracket is never found, this is sufficient to ensure convergence to that endpoint, which must therefore be a local minimum since we'd not have found a nearby point at which the function took a lesser value.
It's not often that an algorithm works as expected even when its assumptions are violated, so you should be a little bit impressed by the rock solid stability of this one!

The implementation of ak.bisectMinimum can he found in BisectMinimum.js and program 1 illustrates its use.

Program 1: ak.bisectMinimum

Note that the default code in program 1 begins with an invalid bracket and ends up in a local minimum, marked with a red cross. You can try changing the bounds x0 and x1 to see what effect they have upon the result and also changing the function f to satisfy yourself that ak.bisectMinimum is indeed a general purpose function minimiser.

A Golden Opportunity

An unfortunate property of ak.bisectMinimum is that the rate at which it which it shrinks the width of the bracket can change from step to step. To see how, let's consider the first step in which we begin with a bracket \(a\), \(b\), \(c\) where \(b\) is the midpoint of \([a,c]\) and we pick a new point \(d\) that bisects \([a,b]\).
Now, should we have a bracket in \(a\), \(d\) and \(b\) then it would be half the width of the original bracket. If, on the other hand, we have one in \(d\), \(b\) and \(c\) then it would instead only be three quarters its width. In the latter case we can at least guarantee that the next step will reduce the width to half of the original, but whether a reduction by half takes one or two steps represents a significant level of uncertainty in the performance of the algorithm.
Better then would be to ensure that the width of the bracket is reduced by the same proportion at every step.

To figure out what that proportion should be, assume that we have a bracket of \(0\), \(b\) and \(1\) with \(b<\frac{1}{2}\). The trial point for the next bracket, \(d\), will divide the larger interval \([b,1]\) and must satisfy
\[ d = 1-b \]
if both of the next possible brackets \(0\), \(b\), \(d\) and \(b\), \(d\), \(1\) are to have the same width.
If \(d\) yields the optimal division of the interval \([b,1]\), then the brackets at every step should be divided in the same proportions and therefore we must also have
\[ \frac{d-b}{1-b} = b \]
Substituting the first relation into the second yields
\[ \begin{align*} \frac{1-2b}{1-b} &= b\\ 1-2b &= b-b^2\\ b^2-3b+1 &= 0 \end{align*} \]
This is simply a quadratic equation in \(b\) which we can easily solve with the formula that we learned as children
\[ b = \frac{3 \pm \sqrt{3^2-4 \times 1 \times 1}}{2\times 1} = \frac{3 \pm \sqrt{5}}{2} \]
of which only \(\frac{3 - \sqrt{5}}{2}\) lies between zero and one. The trial point for the next bracket is therefore given by
\[ d = 1-b = 1 - \frac{3 - \sqrt{5}}{2} = \frac{1 + \sqrt{5}}{2} - 1 = \phi - 1 \approx 0.618 \]
where \(\phi\) is the golden ratio, and as a result this algorithm is known as golden section search.

ak.goldenMinimum

Since the only difference between our bisection algorithm and golden section search is the choice of the dividing point in the larger bracket, you might expect that there's not much of a difference between their implementations.

And you'd be right.

Listing 4 gives the implementation of ak.goldenMinimum which, like ak.bisectMinimum, simply checks the validity of the arguments, provides a default convergence threshold if none is given and returns a function that defers to another function, minimum, to do the actual work.

Listing 4: ak.goldenMinimum
ak.goldenMinimum = function(f, threshold) {
 if(ak.nativeType(f)!==ak.FUNCTION_T) {
  throw new Error('invalid function in ak.goldenMinimum');
 }

 threshold = ak.nativeType(threshold)===ak.UNDEFINED_T ? Math.pow(ak.EPSILON, 0.75)
                                                       : Math.abs(threshold);
 if(isNaN(threshold)) {
  throw new Error('invalid convergence threshold in ak.goldenMinimum');
 }

 return function(x0, x1) {return minimum(f, x0, x1, threshold);};
};

The minimum function is given in listing 5 and is, as expected, much like the one we used in ak.bisectMinimum.

Listing 5: minimum
var D = ak.PHI-1;
var B = 1-D;

function minimum(f, x0, x1, eps) {
 var f0 = f(x0);
 var f1 = f(x1);
 var m, fm, x, fx, i;

 if(f0<f1) {
  m  = x0; x0 = x1; x1 = m;
  fm = f0; f0 = f1; f1 = fm;
 }

 m  = x0!==x1 ? D*x0 + B*x1 : x0;
 fm = f(m);
 x  = [x0, m, x1];
 fx = [f0, fm, f1];

 if(!isFinite(m)) throw new Error('invalid argument in ak.goldenMinimum');
 for(i=0;i<3;++i) if(isNaN(fx[i])) fx[i] = ak.INFINITY;

 while(ak.diff(x[0], x[2])>eps) {
  m  = bisect(x, fx);
  if(isNaN(fm = f(m))) fm = ak.INFINITY;
  update(x, fx, m, fm);
 }
 if(f0<fx[1] || f1<fx[1]) return f0<f1 ? x0 : x1;
 return !isNaN(f(x[1])) ? x[1] : ak.NaN;
}

Once again, we keep the bracket in x and its function values in fx. The main difference between the implementation of this version of minimum and the previous one is that we cannot simply switch indices at each step to select the larger interval in the bracket. Instead, we shall ensure that the first interval in the bracket is always the smaller one. For the initial bracket, this means that m needs to be a fraction \(b\) into the interval defined by x0 and x1. With a little algebra, we achieve this with
\[ \begin{align*} m &= x_0 + b \times (x_1-x_0)\\ &= (1-b) \times x_0 + b \times x_1\\ &= d \times x_0 + b \times x_1 \end{align*} \]
since \(d = 1-b\).
In the main body of the loop, the disturbingly inaccurately named bisect function must ensure that each trial point is a fraction \(d\) into the bracket, whilst the reassuringly appropriately named update function must guarantee that it will keep the smaller interval in x[0] and x[1]. Both of these functions are given in listing 6.

Listing 6: bisect And update
function bisect(x, fx) {
 var m = B*x[0] + D*x[2];

 if(m===x[1] || m===x[2]) m = x[0] = x[1] = x[2] = fx[1]<fx[2] ? x[1] : x[2];
 return m;
}

function update(x, fx, m, fm) {
 if(fm>fx[1]) {
  x[2] = x[0];
  x[0] = m;

  fx[2] = fx[0];
  fx[0] = fm;
 }
 else {
  x[0] = x[1];
  x[1] = m;

  fx[0] = fx[1];
  fx[1] = fm;
 }
}

Here, bisect exploits a similar formula to the one that we used to set up the initial bracket to correctly choose the trial point.
Now, if we start a step with a bracket of \(a\), \(b\) and \(c\) in which \([a,b]\) is the smaller interval and we divide \([b,c]\) with \(d\) then if the next bracket is given by \(a\), \(b\) and \(d\) we must swap the endpoints since \([b,d]\) is smaller than \([a,b]\). Rather than do this explicitly, update simply manipulates x and fx so that the end result is the same as if it had.
Finally, since the only behavioural difference is the proportion by which we split up the bracket, exactly the same argument applies as to its effectiveness in the event that we begin with an invalid bracket.

Program 2 demonstrates ak.goldenMinimum and its implementation can be found in GoldenMinimum.js.

Program 2: ak.goldenMinimum

That ak.goldenMinimum converges upon the global minimum in this example when ak.bisectMinimum didn't is fortunate but is by no means guaranteed in general. It is important to keep in mind that the only advantage of golden section search is that it guarantees the rate of convergence to a local minimum.

References

[1] Rooting Around For Answers, www.thusspakeak.com, 2014.

[2] On The Shoulders Of Gradients, www.thusspakeak.com, 2014.

[3] Every Secant Counts, www.thusspakeak.com, 2014.

Leave a comment