Bring Out The Big Flipping GunS

| 0 Comments

Last month we took a look at quasi-Newton multivariate function minimisation algorithms[1] which use approximations of the Hessian matrix of second partial derivatives to choose line search directions. We demonstrated that the BFGS rule for updating the Hessian after each line search maintains its positive definiteness if they conform to the Wolfe conditions[2], ensuring that the locally quadratic approximation of the function defined by its value, the vector of first partial derivatives and the Hessian has a minimum.
Now that we've got the theoretical details out of the way it's time to get on with the implementation.

The BFGS Rule

Recall that we begin by using a truncated multivariate Taylor series to quadratically approximate the value of a function \(f\) in the neighbourhood of a point \(\mathbf{x}\)
\[ f(\mathbf{x} + \Delta \mathbf{x}) \approx f(\mathbf{x}) + \Delta \mathbf{x} \times \nabla f(\mathbf{x}) + \tfrac12 \times \Delta \mathbf{x} \times \mathbf{H}(\mathbf{x}) \times \Delta \mathbf{x} \]
where \(\Delta \mathbf{x}\) is a step away from \(\mathbf{x}\), \(\nabla f(\mathbf{x})\) is the vector of partial derivatives and \(\mathbf{H}\) is the Hessian, to derive the linear approximation of the effect of the step upon its derivative
\[ \nabla f(\mathbf{x} + \Delta \mathbf{x}) \approx \nabla f(\mathbf{x}) + \mathbf{H}(\mathbf{x}) \times \Delta \mathbf{x} \]
At the minimum the derivative must be zero and so we surmised that a reasonable step might be given by
\[ \nabla f(\mathbf{x}) + \mathbf{H}(\mathbf{x}) \times \Delta \mathbf{x} = \mathbf{0}\\ \Delta \mathbf{x} = -\mathbf{H}^{-1}(\mathbf{x}) \times \nabla f(\mathbf{x}) \]
Next, if we're iteratively taking steps using approximated Hessians then we should expect the \((k+1)^\mathrm{th}\) to reflect the change in the derivative during the \(k^\mathrm{th}\) step
\[ \begin{align*} \nabla f\left(\mathbf{x}_k + \Delta \mathbf{x}_k\right) &= \nabla f\left(\mathbf{x}_k\right) + \hat{\mathbf{H}}_{k+1} \times \Delta \mathbf{x}_k\\ \hat{\mathbf{H}}_{k+1} \times \Delta \mathbf{x}_k &= \nabla f\left(\mathbf{x}_k + \Delta \mathbf{x}_k\right) - \nabla f\left(\mathbf{x}_k\right) = \mathbf{y}_k \end{align*} \]
Finally, we saw that the BFGS update rule does so with
\[ \begin{align*} \hat{\mathbf{H}}_{k+1} &= \hat{\mathbf{H}}_k + \Delta \hat{\mathbf{H}}_k\\ \Delta \hat{\mathbf{H}}_k &= \frac{\mathbf{y}_k \otimes \mathbf{y}_k}{\mathbf{y}_k \times \Delta \mathbf{x}_k} - \frac{\hat{\mathbf{H}}_k \times \left(\Delta \mathbf{x}_k \otimes \Delta \mathbf{x}_k\right) \times \hat{\mathbf{H}}_k}{\Delta \mathbf{x}_k \times \hat{\mathbf{H}}_k \times \Delta \mathbf{x}_k} \end{align*} \]
where \(\otimes\) is the outer product of a pair of vectors, being the matrix defined by
\[ \begin{align*} \mathbf{M} &= \mathbf{u} \otimes \mathbf{v}\\ m_{ij} &= u_i \times v_j \end{align*} \]
Furthermore, it allows us to directly update the inverse of the approximation with
\[ \begin{align*} \hat{\mathbf{H}}_{k+1}^{-1} &= \hat{\mathbf{H}}_k^{-1} + \Delta \hat{\mathbf{H}}_k^{-1}\\ \Delta \hat{\mathbf{H}}_k^{-1} &= \frac{\left(\Delta \mathbf{x}_k \times \mathbf{y}_k + \mathbf{y}_k \times \hat{\mathbf{H}}_k^{-1} \times \mathbf{y}_k\right) \times \left(\Delta \mathbf{x}_k \otimes \Delta \mathbf{x}_k\right)}{\left(\Delta \mathbf{x}_k \times \mathbf{y}_k\right)^2} - \frac{\hat{\mathbf{H}}_k^{-1} \times \left(\mathbf{y}_k \otimes \Delta \mathbf{x}_k\right) + \left(\Delta \mathbf{x}_k \otimes \mathbf{y}_k\right) \times \hat{\mathbf{H}}_k^{-1}}{\Delta \mathbf{x}_k \times \mathbf{y}_k} \end{align*} \]
saving us the expense of having to explicitly invert it.
Figure 1: Rosenbrock's Function

A Worked Example

A common test case for multivariate minimisation algorithms is Rosenbrock's function, which is defined by
\[ f(\mathbf{x}) = 100 \times \left(x_1 - x_0^2\right)^2 + \left(1-x_0\right)^2 \]
and represents a narrow steeply sided curved valley with a shallow sloped floor, as illustrated by figure 1. The reason for this is that steps in a downhill direction from one side of the valley floor frequently end upon the other side of it, with the resulting zigzag path slowing progress toward the minimum at \((1,1)\).
Its derivative is trivially given by
\[ \nabla f(\mathbf{x}) = \begin{pmatrix} -400 \times \left(x_1 - x_0^2\right) \times x_0 - 2 \times \left(1-x_0\right)\\ 200 \times \left(x_1 - x_0^2\right) \end{pmatrix} \]
and program 1 uses our ak.vector and ak.matrix types in the implementations of both the function and its derivative together with that of \(\Delta \hat{\mathbf{H}}_k^{-1}\).

Program 1: Minimising Rosenbrock's Function

Note that we don't need to negate the product of \(\hat{\mathbf{H}}_k^{-1}\) and \(\nabla f\left(\mathbf{x}_k\right)\) to find \(\Delta \mathbf{x}_k\) since our ak.wolfeLineSearch function[3] will do it for us.
Whilst the algorithm clearly steps back and forth where the valley is particularly curved as it corrects for moving up a valley wall, it still manages to reach the minimum with a budget of just fifty line searches. It converges so quickly, in fact, that we need a quarter of a second delay between steps in order to see how they progress.

ak.quasiNewtonMinimum

The inverse Hessian update in the above example creates several matrices during the calculation and is consequently far too inefficient to use in a decent implementation of the BFGS quasi-Newton algorithm. Using \(\mathbf{A}\) and \(\mathbf{B}\) to represent matrices, \(\mathbf{u}\) and \(\mathbf{v}\) to represent vectors and \(c\) to represent a scalar we can use the following four identities to eliminate them.
Firstly
\[ \mathbf{v} \times \mathbf{A} \times \mathbf{v} = \sum_i \sum_j v_i \times a_{ij} \times v_j \]
where \(\sum\) is the summation sign, secondly
\[ \begin{align*} \mathbf{A} &= c \times (\mathbf{v} \otimes \mathbf{v})\\ a_{ij} &= c \times v_i \times v_j \end{align*} \]
thirdly
\[ \begin{align*} \mathbf{A} &= \mathbf{B} \times (\mathbf{u} \otimes \mathbf{v})\\ a_{ij} &= \sum_k b_{ik} \times u_k \times v_j \end{align*} \]
and fourthly
\[ \mathbf{A}^\mathrm{T} = (\mathbf{v} \otimes \mathbf{u}) \times \mathbf{B} \]
Together these allow us to calculate the elements of \(\hat{\mathbf{H}}^{-1}_{k+1}\) using a single two dimensional array, as illustrated by listing 1.

Listing 1: The BFGS Update
function update(h, dx, y) {
 var n = dx.dims();
 var h1 = h.toArray();
 var dxy = ak.mul(dx, y);
 var yhy = 0;
 var i, j, k, yi, h1i, c, dxi, cdxi, dxj, hydx;

 for(i=0;i<n;++i) {
  yi = y.at(i);
  h1i = h1[i];
  for(j=0;j<n;++j) yhy += yi*h1i[j]*y.at(j);
 }
 c = (dxy+yhy)/(dxy*dxy);

 for(i=0;i<n;++i) {
  dxi = dx.at(i);
  cdxi = c*dxi;
  h1i = h1[i];

  for(j=0;j<i;++j) h1[j][i] = h1i[j] += cdxi*dx.at(j);
  h1i[i] += cdxi*dxi;

  for(j=0;j<n;++j) {
   dxj = dx.at(j);

   hydx = 0;
   for(k=0;k<n;++k) hydx += h.at(i,k)*y.at(k)*dxj;
   hydx /= dxy;

   h1i[j]   -= hydx;
   h1[j][i] -= hydx;
  }
 }
 return ak.matrix(h1);
}

Listing 2 gives the implementation of the quasi-Newton search which uses the wolfe line search to take a step and update to apply the BFGS rule to the approximate inverse Hessian.

Listing 2: The Quasi-Newton Search
function minimum(f, df, x, wolfe, threshold, steps) {
 var fx, dfx, adx, adfx, eps, h, y, dx, adx, adfx, eps, stop;

 if(ak.type(x)!==ak.VECTOR_T || x.dims()===0) {
  throw new Error('invalid starting point in ak.quasiNewtonMinimum');
 }

 fx = f(x);
 dfx = df(x);
 adfx = ak.abs(dfx);
 eps = threshold*(1+ak.abs(x));
 if(isNaN(fx) || !(adfx>eps)) {
  return !isNaN(fx) && !isNaN(adfx) ? x : ak.vector(x.dims(), ak.NaN);
 }

 h = ak.matrix('identity', x.dims());

 do {
  y = wolfe(x, ak.mul(h, dfx), fx, dfx);
  dx = ak.sub(y.x, x);
  x = y.x;
  fx = y.fx;

  adx = ak.abs(dx);
  adfx = ak.abs(y.dfx);
  eps = threshold*(1+ak.abs(x));
  stop = isNaN(fx) || !(adx>eps && adfx>eps);
  if(!stop) {
   if(--steps===0) throw new Error('failure to converge in ak.quasiNewtonMinimum');
   h = update(h, dx, ak.sub(y.dfx, dfx));
   dfx = y.dfx;
  }
 }
 while(!stop);

 return !isNaN(fx) && !isNaN(adx) && !isNaN(adfx) ? x : ak.vector(x.dims(), ak.NaN);
}

This first calculates the value of the function and its derivative, terminating immediately if the magnitude of the latter is small compared to the magnitude of the starting point, adding one to the latter to deal with cases where it's very close to zero, or if either of them is NaN, returning the starting point in the former case and a NaN filled vector in the latter. It then initialises the inverse Hessian with the identity matrix and starts the iteration.
This begins with a Wolfe line search, saving the length of the step it takes and updating the current point and the value of the function. The iteration is considered to have converged if either the step length or the derivative's magnitude are small and to have failed if either of them or the value of the function are NaNs. Otherwise the inverse Hessian and the derivative are updated for the next step.
Once it terminates the search returns the final point if it converged and a NaN filled vector if it failed.

Finally, the ak.quasiNewtonMinimum function defined in listing 3 first checks that the function, the derivative, the Wolfe condition constants, the convergence threshold and the maximum number of steps are valid, providing defaults for each of the last four if not defined. It then initialises the Wolfe line search and returns a function to search for a minimum from a starting point x using the minimum function.

Listing 3: ak.quasiNewtonMinimum
ak.quasiNewtonMinimum = function(f, df, c1, c2, threshold, steps) {
 var wolfe;

 if(ak.nativeType(f)!==ak.FUNCTION_T) {
  throw new Error('invalid function in ak.quasiNewtonMinimum');
 }
 if(ak.nativeType(df)!==ak.FUNCTION_T) {
  throw new Error('invalid derivative in ak.quasiNewtonMinimum');
 }

 c1 = ak.nativeType(c1)===ak.UNDEFINED_T ? 1.0e-4 : Number(c1);
 if(!(c1>0 && c1<1)) {
  throw new Error('invalid armijo constant in ak.quasiNewtonMinimum');
 }

 c2 = ak.nativeType(c2)===ak.UNDEFINED_T ? 0.9 : Number(c2);
 if(!(c2>c1 && c2<1)) {
  throw new Error('invalid curvature constant in ak.quasiNewtonMinimum');
 }

 threshold = ak.nativeType(threshold)===ak.UNDEFINED_T ? Math.pow(ak.EPSILON, 0.75)
                                                       : Math.abs(threshold);
 if(isNaN(threshold)) {
  throw new Error('invalid convergence threshold in ak.quasiNewtonMinimum');
 }

 steps = ak.nativeType(steps)===ak.UNDEFINED_T ? 1000 : ak.floor(steps);
 if(!(steps>0)) throw new Error('invalid steps in ak.quasiNewtonMinimum');

 wolfe = ak.wolfeLineSearch(f, df, c1, c2, steps);
 return function(x) {return minimum(f, df, x, wolfe, threshold, steps);};
};

Program 2 demonstrates how to use ak.quasiNewtonMinimum, which is defined in QuasiNewtonMinimum.js, by comparing its performance on Rosenbrock's function to that of our ak.polytopeMinimum minimiser.

Program 2: Using ak.quasiNewtonMinimum

Despite the fact that the BFGS quasi-Newton algorithm requires significantly greater computational expense to determine the direction and length of each step it is nevertheless the faster of the two to converge upon the minimum. This is a testament to its requiring fewer steps to converge which is an advantage if the function is time consuming to evaluate, as demonstrated by program 3 which adds a five millisecond delay to the calculation of both it and its derivative.

Program 3: The Delayed Rosenbrock

If we can directly evaluate the derivative then the BFGS quasi-Newton algorithm is clearly preferable to the polytope algorithm when searching for a minimum.

References

[1] Big Friendly GiantS, www.thusspakeak.com, 2021.

[2] Wolfe, P., Convergence Conditions for Ascent Methods, SIAM Review, Vol. 11, No. 2, 1969.

[3] Wolfe It Down, www.thusspakeak.com, 2021.

[4] The Tripods Are Here!, www.thusspakeak.com, 2015.

Leave a comment