Wolfe It Down

| 0 Comments

Last time[1] we saw how we could efficiently invert a vector valued multivariate function with the Levenberg-Marquardt algorithm which replaces the sum of its second derivatives with respect to each element in its result multiplied by the difference from those of its target value with a diagonal matrix. Similarly there are minimisation algorithms that use approximations of the Hessian matrix of second partial derivatives to estimate directions in which the value of the function will decrease.
Before we take a look at them, however, we'll need a way to step toward minima in such directions, known as a line search, and in this post we shall see how we might reasonably do so.

Walking It Back

The key requirement for line searches is speed, as opposed to accuracy, provided we make reasonable progress toward a minimum. A naive approach is to find a sufficiently large positive \(\alpha\) that satisfies
\[ f(\mathbf{x} + \alpha \times \Delta \mathbf{x}) < f(\mathbf{x}) \]
where \(\Delta \mathbf{x}\) is a direction of descent for \(f(\mathbf{x})\), meaning that if \(\alpha\) is small enough the inequality is guaranteed. The problem with this is that it would allow us to take a large step that has very little impact on the result of \(f\). To guard against that we can use the Armijo rule[2], defined for a constant \(c_1\) that lies strictly between zero and one as
Figure 1: The Armijo Rule
\[ f(\mathbf{x} + \alpha \times \Delta \mathbf{x}) \leqslant f(\mathbf{x}) + c_1 \times \alpha \times \Delta \mathbf{x} \times \nabla f(\mathbf{x}) \]
where \(\nabla f(\mathbf{x})\) is the vector of partial derivatives of \(f\) at \(\mathbf{x}\), so that
\[ \Delta \mathbf{x} \times \nabla f(\mathbf{x}) < 0 \]
This has the effect of requiring longer steps to result in greater decreases in the result of the function, as illustrated by figure 1 which plots the quadratic function
\[ f(x) = x^2 - 0.8 \times x + 0.3 \]
as a solid black line and the the upper bound of acceptable values for \(c_1\) equal to one tenth as a dashed line.

To exploit the Armijo rule we can start with a deliberately excessive value for \(\alpha\) which is then progressively reduced until it is satisfied, known as backtracking. For example, program 1 uses this approach for the quadratic from figure 1, starting with \(\alpha\) equal to four and multiplying it by a factor of three quarters at each step.

Program 1: Backtracking

Note that the search stops short of the minimum which occurs when
\[ f^\prime(x) = 2 \times x - 0.8 = 0 \]
at 0.4.

Unfortunately the Armijo rule can result in very small steps. So small, in fact, that using it to repeatedly backtrack along a given line can fail to converge on its minimum, as shown by derivation 1.

Derivation 1: Failure To Converge
First note that for the above polynomial \(\Delta x_k\) is one for \(x_k\) less than 0.4. Defining
\[ \begin{align*} x_0 &= 0\\ \alpha_k &= \frac18 \times \frac1{2^k} \end{align*} \]
\(x_k\) is given by the geometric series
\[ x_k = \sum_{j=0}^k \alpha_k = \frac18 \times \frac{1-\frac1{2^k}}{1-\frac12} = \frac14 \times \left(1-\frac1{2^k}\right)\\ \]
where \(\sum\) is the summation sign and which is trivially bounded above by one quarter.
Next we have
\[ \begin{align*} f\left(x_k+\alpha_k\right) &= \left(x_k+\alpha_k\right)^2 - 0.8 \times \left(x_k+\alpha_k\right) + 0.3\\ &= x_k^2 - 0.8 \times x_k + 0.3 + 2 \times x_k \times \alpha_k + \alpha_k^2 - 0.8 \times \alpha_k\\ &= f\left(x_k\right) + 2 \times x_k \times \alpha_k + \alpha_k^2 - 0.8 \times \alpha_k \end{align*} \]
and so
\[ f\left(x_k+\alpha_k\right) - f\left(x_k\right) = 2 \times x_k \times \alpha_k + \alpha_k^2 - 0.8 \times \alpha_k = \alpha_k \times \left(2 \times x_k + \alpha_k - 0.8\right) \]
Given that this is negative for all \(k\), \(\alpha_k\) is an upper bound on the \(k^\mathrm{th}\) step and so \(x_k\) can never reach the minimum.

A Wolfe Pack

Derivation 2: Satisfying The Wolfe Conditions
The curvature condition is trivially satisfied by a minimum and so we need only concern ourselves with line curves that have negative gradients throughout the region in which they satisfy the Armijo rule.
If \(\alpha\) is the step length to the nearest point at which the rule is about to be broken, we have
\[ \begin{align*} f(\mathbf{x} + \alpha \times \Delta \mathbf{x}) &= f(\mathbf{x}) + c_1 \times \alpha \times \Delta \mathbf{x} \times \nabla f(\mathbf{x})\\ \Delta \mathbf{x} \times \nabla f(\mathbf{x}) &= \frac{1}{c_1} \times \frac{f(\mathbf{x} + \alpha \times \Delta \mathbf{x}) - f(\mathbf{x})}{\alpha}\\ &= \frac{1}{c_1} \times g \end{align*} \]
where \(g\) is the gradient of the straight line connecting the starting point and the breaking point. Furthermore, since the curve can only break through the line if it has a greater gradient
\[ \Delta \mathbf{x} \times \nabla f\left(\mathbf{x} + \alpha \times \Delta \mathbf{x}\right) > g \]
Substituting the gradient at \(\mathbf{x}\) into the right hand side of the curvature condition yields
\[ c_2 \times \Delta \mathbf{x} \times \nabla f\left((\mathbf{x}\right) = \frac{c_2}{c_1} \times g \]
Finally, since \(g\) is negative, we have
\[ \frac{1}{c_1} \times g < \frac{c_2}{c_1} \times g < g \]
and so there must be points between the starting and breaking points that satisfy both conditions.
To rectify this we can use the Wolfe conditions[3] which are comprised of both the Armijo rule and the curvature condition
\[ \Delta \mathbf{x} \times \nabla f\left(\mathbf{x} + \alpha \times \Delta \mathbf{x}\right) \geqslant c_2 \times \Delta \mathbf{x} \times \nabla f\left(\mathbf{x}\right) \]
where \(c_2\) is greater than \(c_1\) and less than one. The latter additionally ensures that the gradient after the step is either sufficiently less negative than it was before, in which case it will have moved a not insignificant distance towards the minimum, or non-negative, in which case it will have reached or overstepped it.
Like any test based on gradients, it's possible for the curvature condition to incorrectly identify local maxima or points of inflection as local minima, although in practice it's unlikely that we'd step exactly to one for any reasonably well behaved function.

Of course, we can't simply assume that both conditions can be met simultaneously and so derivation 2 proves it for smooth continuous line curves.

Whilst proving that such points exist isn't too difficult, finding them is another matter entirely. The backtracking algorithm simply isn't up to the job since it can jump over the points that meet the curvature condition while searching for one that satisfies the Armijo rule.

Wolfe Tracking

A simple, relatively robust, albeit not especially efficient, approach is bisection search, which we have previously[4] used to numerically invert scalar valued univariate functions. If we can find an interval, in this context known as a bracket, that is sure to contain points that meet the Wolfe conditions then by iteratively replacing its upper or lower bound, as appropriate, with its midpoint then we can narrow it down until the midpoint eventually passes them.
We can use our proof of the existence of satisfactory points as a means to do this. If we start with a bracket whose lower bound satisfies the Armijo rule and whose upper bound doesn't then we know it must contain points that meet our criteria. If the midpoint passes both of the Wolfe conditions then we've found what we're looking for and can stop searching. Otherwise, if it passes the Armijo rule it replaces the lower bound and if it doesn't it replaces the upper bound, yielding a new bracket of half the width that we know must also contain satisfactory points.

To construct the initial bracket we first take a step from the starting point. If the new point passes both tests then we don't need a bracket and can stop immediately. If it fails the Armijo rule then we have an upper bound and need to find a lower bound, and if it doesn't then we have a lower bound and need to find an upper bound, as done by the bracket function in listing 1 using bracketLeft and bracketRight respectively.

Listing 1: The Initial Search
function armijo(c1, fx, dxdfx, a, fxadx) {
 return fxadx <= fx + c1*a*dxdfx;
}

function curvature(c2, dx, dxdfx, dfxadx) {
 return ak.mul(dx, dfxadx) >= c2*dxdfx;
}

function bracket(x, dx, fx, dfx, dxdfx, f, df, c1, c2, steps) {
 var xdx = ak.add(x, dx);
 var fxdx, dfxdx;

 fxdx = f(xdx);
 if(!armijo(c1, fx, dxdfx, 1, fxdx)) {
  return bracketLeft(x, dx, fx, dfx, dxdfx, f, df, c1, c2, steps);
 }

 dfxdx = df(xdx);
 if(!curvature(c2, dx, dxdfx, dfxdx)) {
  return bracketRight(x, dx, fx, dfx, dxdfx, fxdx, f, df, c1, c2, steps);
 }

 return {x:xdx, fx:fxdx, dfx:dfxdx, passed:true};
}

Here dx represents both the direction and length of the initial step which we can get away with because the Armijo rule always includes the length and it cancels out in the left and right hand sides of the curvature condition. The functions f and df calculate the value of the function and its derivative respectively, the starting point is given by x, the value of the function at it by fx and the derivative by dfx. The product of the initial step and the derivative at \(\mathbf{x}\) is stored in dxdfx and a represents a scaling factor of the initial step with fxadx and dfxadx being the values of the function and derivative at the result of scaled steps.

To find a lower bound given an upper bound the bracketLeft function defined in listing 2 repeatedly halves the initial step until the result of the Armijo rule ac is true.

Listing 2: The Left Bracket Search
function bracketLeft(x, dx, fx, dfx, dxdfx, f, df, c1, c2, steps) {
 var a = 0.5;
 var xadx = ak.add(x, ak.mul(a, dx));
 var fxadx = f(xadx);
 var dfxadx = df(xadx);
 var ac = armijo(c1, fx, dxdfx, a, fxadx);

 while(!ac) {
  if(steps--===0) {
   throw new Error('failure to find bracket in ak.wolfeLineSearch');
  }
  a *= 0.5;
  xadx = ak.add(x, ak.mul(a, dx));
  fxadx = f(xadx);
  dfxadx = df(xadx);
  ac = armijo(c1, fx, dxdfx, a, fxadx);
 }
 return {a0:a, a1:2*a, x:xadx, fx:fxadx, dfx:dfxadx,
         passed:ac && curvature(c2, dx, dxdfx, dfxadx)
        };
}

Note that once we've found a lower bound for \(\alpha\) its upper bound must be twice its value.

To find an upper bound we instead keep doubling the step length until the Armijo rule is false or the curvature condition succeeds, as done by bracketRight in listing 3.

Listing 3: The Right Bracket Search
function bracketRight(x, dx, fx, dfx, dxdfx, fxdx, f, df, c1, c2, steps) {
 var a = 2;
 var a0 = 1;
 var xadx = ak.add(x, ak.mul(a, dx));
 var fxadx = f(xadx);
 var dfxadx = df(xadx);
 var ac = armijo(c1, fx, dxdfx, a, fxadx);
 var cc = curvature(c2, dx, dxdfx, dfxadx);

 while(ac && !cc) {
  if(steps--===0) {
   throw new Error('failure to find bracket in ak.wolfeLineSearch');
  }
  a *= 2;
  xadx = ak.add(x, ak.mul(a, dx));
  fxadx = f(xadx);
  dfxadx = df(xadx);
  ac = armijo(c1, fx, dxdfx, a, fxadx);
  cc = curvature(c2, dx, dxdfx, dfxadx);

  if(fxadx<fxdx) {
   a0 = a;
   fxdx = fxadx;
  }
 }
 return {a0:a0, a1:a, x:xadx, fx:fxadx, dfx:dfxadx, passed:ac&&cc};
}

To narrow the bracket we're also replacing the lower bound at each step if the function has a smaller value than the smallest found so far, ensuring that we don't put it past the minimum.

Figure 2: Two Local Minima
Now it's possible that the upper bound will step past a local maximum and into the neighbourhood of a new local minimum, as might happen in figure 2.
If so there are three possibilities. Firstly, if the new point satisfies the Wolfe conditions then we'll simply accept it. Secondly, if the new point satisfies the Armijo rule and has a smaller value than the current smallest then the lower bound will move to the new minimum and the search will continue from there. Thirdly, if not, then we shall continue looking for an upper bound that meets our requirements. In the last case the bisection search will settle into the region of one or the other local minima.
It's also possible that the upper bound will end up at a point beyond which the curve never fails the Armijo rule or meets the curvature condition. If so, the search for a bracket will run out of steps and throw an exception, which isn't entirely unreasonable given that in all likelihood the curve is unbounded below and consequently doesn't have a minimum.
Another possibility is that it'll keep jumping from the region of one local minima to another, never satisfying the termination requirement, in which case we'll run out of steps again and generate an error, which is fair enough since it means that there isn't a uniquely identifiable local minimum to approach.

Returning to our implementation, if the search for a bracket finds a point that satisfies the Wolfe conditions then the passed property of the result will be true and the x, fx and dfx will contain its location, the value of the function and its derivative respectively.
If not, we shall resort to bisection, as implemented in listing 4.

Listing 4: The Bisection Search
function bisect(a0, a1, x, dx, fx, dfx, dxdfx, f, df, c1, c2, steps) {
 var a = a0/2 + a1/2;
 var xadx = ak.add(x, ak.mul(a, dx));
 var fxadx = f(xadx);
 var dfxadx = df(xadx);
 var ac = armijo(c1, fx, dxdfx, a, fxadx);
 var cc = curvature(c2, dx, dxdfx, dfxadx);

 while(!ac || !cc) {
  if(steps--===0) {
   throw new Error('failure to converge in ak.wolfeLineSearch');
  }

  if(!ac) a1 = a;
  else    a0 = a;

  a = a0/2 + a1/2;
  xadx = ak.add(x, ak.mul(a, dx));
  fxadx = f(xadx);
  dfxadx = df(xadx);
  ac = armijo(c1, fx, dxdfx, a, fxadx);
  cc = curvature(c2, dx, dxdfx, dfxadx);
 }
 return {x:xadx, fx:fxadx, dfx:dfxadx};
}

ak.wolfeLineSearch

The ak.wolfeLineSearch function shown in listing 5 and implemented in WolfeLineSearch.js uses these functions to create a line search function that uses the Wolfe conditions.

Listing 5: ak.wolfeLineSearch
ak.wolfeLineSearch = function(f, df, c1, c2, steps) {
 if(ak.nativeType(f)!==ak.FUNCTION_T) {
  throw new Error('invalid function in ak.wolfeLineSearch');
 }
 if(ak.nativeType(df)!==ak.FUNCTION_T) {
  throw new Error('invalid derivative in ak.wolfeLineSearch');
 }

 c1 = Number(c1);
 if(!(c1>0 && c1<1)) {
  throw new Error('invalid armijo constant in ak.wolfeLineSearch');
 }

 c2 = Number(c2);
 if(!(c2>c1 && c2<1)) {
  throw new Error('invalid curvature constant in ak.wolfeLineSearch');
 }

 steps = ak.nativeType(steps)===ak.UNDEFINED_T ? 1000 : ak.floor(steps);
 if(!(steps>0)) throw new Error('invalid steps in ak.wolfeLineSearch');

 return function(x, dx, fx, dfx) {
  var dxdfx, b;
  if(ak.nativeType(fx)===ak.UNDEFINED_T)  fx = f(x);
  if(ak.nativeType(dfx)===ak.UNDEFINED_T) dfx = df(x);

  dxdfx = ak.mul(dx, dfx);
  if(dxdfx===0) return {x: x, fx: fx, dfx: dfx};

  if(dxdfx>0) {
   dx = ak.neg(dx);
   dxdfx = -dxdfx;
  }

  b = bracket(x, dx, fx, dfx, dxdfx, f, df, c1, c2, steps);
  if(b.passed) return {x:b.x, fx:b.fx, dfx:b.dfx};
  return bisect(b.a0, b.a1, x, dx, fx, dfx, dxdfx, f, df, c1, c2, steps);
 };
};

This first checks that the function and its derivative f and df are functions, that c1 and c2 are numbers within the allowed ranges for the Wolfe conditions and that steps, if defined, is no less than one, defaulting to one thousand if undefined. It then returns the line search function which takes a starting point x, an initial step dx and optional function and derivative values at the starting point fx and dfx, explicitly calculating them if not defined.
If the gradient in the direction of the step is zero then we immediately return the starting point, the value of the function and its derivative, and if it is positive, meaning that the step is in a direction of ascent, we reverse the step by negating it. Finally we search for a bracket, returning its values for x, fx and dfx if a point satisfying the Wolfe conditions was found and applying the bisection search otherwise.

Program 2 demonstrates its use by iteratively applying it to our example quadratic function with a relatively large initial step so that it oscillates around the minimum as it converges. Note that step length depends upon the magnitude of the derivative, meaning that it shrinks as the minimum is approached.

Program 2: Iterated Wolfe Search

This nicely illustrates why ak.wolfeLineSearch returns the value and derivative of the function in addition to the point that it settles upon; they can be used to both test for the convergence of an iterative algorithm and begin its next step.

References

[1] Found In Space, www.thusspakeak.com, 2021.

[2] Armijo, L., Minimization of functions having Lipschitz continuous first partial derivatives, Pacific Journal of Mathematics, Vol. 16, No. 1, 1966.

[3] Wolfe, P., Convergence Conditions for Ascent Methods, SIAM Review, Vol. 11, No. 2, 1969.

[4] Rooting Around For Answers, www.thusspakeak.com, 2014.

Leave a comment