# A PR Exercise

In the last few posts we've been looking at the BFGS quasi-Newton algorithm for minimising multivariate functions[1][2]. This uses iteratively updated approximations of the Hessian matrix of second partial derivatives in order to choose directions in which to search for univariate minima, saving the expense of calculating it explicitly. A particularly useful property of the algorithm is that if the line search satisfies the Wolfe conditions[3] then the positive definiteness of the Hessian is preserved, meaning that the implied locally quadratic approximation of the function must have a minimum.
Unfortunately for large numbers of dimension the calculation of the approximation will still be relatively expensive and will require a significant amount of memory to store and so in this post we shall take a look at an algorithm that only uses the vector of first partial derivatives.

Given a positive definite matrix $$\mathbf{A}$$, a vector $$\mathbf{b}$$ and a scalar $$c$$, the quadratic function
$f(\mathbf{x}) = \tfrac12 \times \mathbf{x} \times \mathbf{A} \times \mathbf{x} - \mathbf{b} \times \mathbf{x} + c$
has the partial derivative vector
$\nabla f(\mathbf{x}) = \tfrac12 \times \mathbf{x} \times \mathbf{A} + \tfrac12 \times \mathbf{A} \times \mathbf{x} - \mathbf{b} = \mathbf{A} \times \mathbf{x} - \mathbf{b}$
since $$\mathbf{A}$$ is symmetric, and must take a minimum value when all of its elements equal zero.

Derivation 1: Linear Combinations Of Conjugate Vectors
A positive definite matrix $$\mathbf{M}$$ has a Cholesky decomposition[4]
$\mathbf{M} = \mathbf{L} \times \mathbf{L}^\mathrm{T}$
where $$\mathbf{L}$$ is lower triangular and invertible. We therefore have
\begin{align*} \mathbf{u} \times \mathbf{M} \times \mathbf{v} &= \mathbf{u} \times \left(\mathbf{L} \times \mathbf{L}^\mathrm{T}\right) \times \mathbf{v}\\ &= \left(\mathbf{u} \times \mathbf{L}\right) \times \left(\mathbf{L}^\mathrm{T} \times \mathbf{v}\right)\\ &= \left(\mathbf{L}^\mathrm{T} \times \mathbf{u}\right) \times \left(\mathbf{L}^\mathrm{T} \times \mathbf{v}\right) \end{align*}
and so $$\mathbf{L}^\mathrm{T}$$ defines a new coordinate system in which vectors that are conjugate with respect to $$\mathbf{M}$$ are orthogonal.
If in $$n$$ dimensions we have a set of $$n$$ orthogonal vectors $$\mathbf{v}_i$$ then every point can be represented as a linear combination of them
$\mathbf{x} = \sum_{i=0}^{n-1} \alpha_i \times \mathbf{v}_i$
If instead the vectors are conjugate with respect to $$\mathbf{M}$$ then we have
$\mathbf{L}^\mathrm{T} \times \mathbf{x} = \mathbf{L}^\mathrm{T} \times \sum_{i=0}^{n-1} \alpha_i \times \mathbf{v}_i = \sum_{i=0}^{n-1} \alpha_i \times \mathbf{L}^\mathrm{T} \times \mathbf{v}_i$
Since $$\mathbf{L}$$, and consequently $$\mathbf{L}^\mathrm{T}$$, are invertible, every point can also be expressed as a linear combination of these conjugate vectors.
Given a matrix $$\mathbf{M}$$, a pair of vectors $$\mathbf{u}$$ and $$\mathbf{v}$$ satisfying
$\mathbf{u} \times \mathbf{M} \times \mathbf{v} = 0$
are known as being conjugate with respect to $$\mathbf{M}$$. We can represent any point in an $$n$$ dimensional space as a linear combination of a set of $$n$$ vectors $$\mathbf{v}_i$$ that are conjugate with respect to a positive definite $$\mathbf{M}$$, as proven in derivation 1.
If we have such a set for $$\mathbf{A}$$ then we can express the solution $$\mathbf{x}^\ast$$ of the quadratic equation with
$\mathbf{x}^\ast = \sum_{i=0}^{n-1} \alpha_i \times \mathbf{v}_i$
for some set of coefficients $$\alpha_i$$, where $$\sum$$ is the summation sign. We consequently have
$\mathbf{b} = \mathbf{A} \times \mathbf{x}^\ast = \sum_{i=0}^{n-1} \alpha_i \times \mathbf{A} \times \mathbf{v}_i$
and
\begin{align*} \mathbf{v}_j \times \mathbf{b} &= \sum_{i=0}^{n-1} \alpha_i \times \mathbf{v}_j \times \mathbf{A} \times \mathbf{v}_i\\ &= \alpha_j \times \mathbf{v}_j \times \mathbf{A} \times \mathbf{v}_j \end{align*}
so that the coefficients are given by
$\alpha_j = \frac{\mathbf{v}_j \times \mathbf{b}}{\mathbf{v}_j \times \mathbf{A} \times \mathbf{v}_j}$

### An Iterative Approach

We don't actually need to find a set of conjugate vectors in advance but can instead generate them one at a time to take steps towards the minimum with
$\mathbf{x}_{j+1} = \mathbf{x}_j + \alpha_j \times \mathbf{v}_j$
where
\begin{align*} \mathbf{v}_j &= \mathbf{r}_j - \sum_{i=0}^{j-1} \frac{\mathbf{v}_i \times \mathbf{A} \times \mathbf{r}_j}{\mathbf{v}_i \times \mathbf{A} \times \mathbf{v}_i} \times \mathbf{v}_i\\ \alpha_j &= \frac{\mathbf{v}_j \times \mathbf{r}_j}{\mathbf{v}_j \times \mathbf{A} \times \mathbf{v}_j} \end{align*}
for
$\mathbf{r}_j = \mathbf{b} - \mathbf{A} \times \mathbf{x}_j$
as proven in derivation 2.

Derivation 2: The Conjugate Vectors And Their Coefficients
We can easily see that the $$\mathbf{v}_j$$ are conjugate with respect to $$\mathbf{A}$$ since for any $$k$$ less than $$j$$ we have
\begin{align*} \mathbf{v}_j \times \mathbf{A} \times \mathbf{v}_k &= \mathbf{r}_j \times \mathbf{A} \times \mathbf{v}_k - \sum_{i=0}^{j-1} \frac{\mathbf{v}_i \times \mathbf{A} \times \mathbf{r}_j}{\mathbf{v}_i \times \mathbf{A} \times \mathbf{v}_i} \times \mathbf{v}_i \times \mathbf{A} \times \mathbf{v}_k\\ &= \mathbf{r}_j \times \mathbf{A} \times \mathbf{v}_k - \frac{\mathbf{v}_k \times \mathbf{A} \times \mathbf{r}_j}{\mathbf{v}_k \times \mathbf{A} \times \mathbf{v}_k} \times \mathbf{v}_k \times \mathbf{A} \times \mathbf{v}_k\\ &= \mathbf{r}_j \times \mathbf{A} \times \mathbf{v}_k - \mathbf{v}_k \times \mathbf{A} \times \mathbf{r}_j = 0 \end{align*}
with the last step following from the fact that $$\mathbf{A}$$ is symmetric. The values of the coefficients can be derived from the value of the quadratic at $$\mathbf{x}_{j+1}$$
\begin{align*} f\left(\mathbf{x}_{j+1}\right) &= \tfrac12 \times \mathbf{x}_{j+1} \times \mathbf{A} \times \mathbf{x}_{j+1} - \mathbf{b} \times \mathbf{x}_{j+1} + c\\ &= \tfrac12 \times \left(\mathbf{x}_j + \alpha_j \times \mathbf{v}_j\right) \times \mathbf{A} \times \left(\mathbf{x}_j + \alpha_j \times \mathbf{v}_j\right) - \mathbf{b} \times \left(\mathbf{x}_j + \alpha_j \times \mathbf{v}_j\right) + c\\ &= \tfrac12 \times \mathbf{x}_j \times \mathbf{A} \times \mathbf{x}_j - \mathbf{b} \times \mathbf{x}_j + c\\ &\quad\quad + \tfrac12 \times \mathbf{x}_j \times \mathbf{A} \times \left(\alpha_j \times \mathbf{v}_j\right) + \tfrac12 \times \left(\alpha_j \times \mathbf{v}_j\right) \times \mathbf{A} \times \mathbf{x}_j\\ &\quad\quad + \tfrac12 \times \left(\alpha_j \times \mathbf{v}_j\right) \times \mathbf{A} \times \left(\alpha_j \times \mathbf{v}_j\right) - \mathbf{b} \times \alpha_j \times \mathbf{v}_j\\ &= f\left(\mathbf{x}_j\right) + \alpha_j \times \mathbf{x}_j \times \mathbf{A} \times \mathbf{v}_j + \tfrac12 \times \alpha_j^2 \times \mathbf{v}_j \times \mathbf{A} \times \mathbf{v}_j - \alpha_j \times \mathbf{b} \times \mathbf{v}_j \end{align*}
Differentiating with respect to $$\alpha_j$$ yields
$\frac{\mathrm{d}f}{\mathrm{d}\alpha_j}\left(\mathbf{x}_{j+1}\right) = \mathbf{x}_j \times \mathbf{A} \times \mathbf{v}_j + \alpha_j \times \mathbf{v}_j \times \mathbf{A} \times \mathbf{v}_j - \mathbf{b} \times \mathbf{v}_j$
Setting this to zero and rearranging gives the required result
$\alpha_j = \frac{\mathbf{b} \times \mathbf{v}_j - \mathbf{x}_j \times \mathbf{A} \times \mathbf{v}_j}{\mathbf{v}_j \times \mathbf{A} \times \mathbf{v}_j} = \frac{\mathbf{v}_j \times \left(\mathbf{b} - \mathbf{A} \times \mathbf{x}_j\right)}{\mathbf{v}_j \times \mathbf{A} \times \mathbf{v}_j}$

Note that this is guaranteed to reach the minimum after $$n$$ steps, which we can see by defining
$\mathbf{x}^\ast = \mathbf{x}_0 + \Delta \mathbf{x}^\ast$
so that
\begin{align*} \mathbf{A} \times \mathbf{x}^\ast &= \mathbf{b}\\ \mathbf{A} \times \left(\mathbf{x}_0 + \Delta \mathbf{x}^\ast\right) &= \mathbf{b}\\ \mathbf{A} \times \Delta \mathbf{x}^\ast &= \mathbf{b} - \mathbf{A} \times \mathbf{x}_0 \end{align*}
which can't require more than $$n$$ conjugate vectors with respect to $$\mathbf{A}$$ to solve. If $$n$$ is particularly large, however, then stopping after a smaller number of steps allows us to efficiently converge upon a reasonable approximation of the minimum.

Program 1 demonstrates this iteration by minimising a two dimensional quadratic with a random positive definite $$\mathbf{A}$$ and $$\mathbf{b}$$ and $$c$$ equal to zero, so that the minimum is at the origin and equals zero, starting from a random point in the lower part of the plain.

Program 1: Minimising A Random Quadratic

### The Fletcher-Reeves Algorithm

Now it probably hasn't escaped your notice that this algorithm doesn't live up to the promise of only using the first partial derivatives since $$\mathbf{A}$$ is the Hessian of the quadratic equation. We can trivially eliminate it from the calculation of $$\mathbf{r}_j$$ by noting that
$\mathbf{r}_j = \mathbf{b} - \mathbf{A} \times \mathbf{x}_j = -\nabla f(\mathbf{x}_j)$
Unfortunately it simply isn't possible to derive conjugate directions without the Hessian, but we can still take steps that follow a similar scheme with
$\mathbf{v}_j = \begin{cases} \mathbf{r}_j & j=0\\ \mathbf{r}_j + \beta_j \times \mathbf{v}_{j-1} & \text{otherwise} \end{cases}$
where the $$\beta_j$$ are chosen so that the algorithm will make progress towards the minimum that is, in some sense, good enough when performing a line search in their direction for the $$\alpha_j$$ to update the current point
$\mathbf{x}_{j+1} = \mathbf{x}_j + \alpha_j \times \mathbf{v}_j$
The first such scheme is the Fletcher-Reeves algorithm[5] which uses
$\beta_j = \frac{\mathbf{r}_j \times \mathbf{r}_j}{\mathbf{r}_{j-1} \times \mathbf{r}_{j-1}}$
and guarantees that each direction satisfies a constraint upon the steepness of its gradient if an exact line search is used to find each $$a_j$$. Al-Baali later showed[6] that if an inexact line search that satisfies
$\left|\nabla f\left(\mathbf{x}_{j+1}\right) \times \mathbf{v}_j\right| \leqslant -\sigma \times \nabla f\left(\mathbf{x}_j\right) \times \mathbf{v}_j$
for positive $$\sigma$$ is used instead then the inequalities
$-\sum_{i=0}^j \sigma^i \leqslant \frac{\nabla f\left(\mathbf{x}_j\right) \times \mathbf{v}_j}{\nabla f\left(\mathbf{x}_j\right) \times \nabla f\left(\mathbf{x}_j\right)} \leqslant -2 + \sum_{i=0}^j \sigma^i$
will be satisfied for all $$j$$ for which the magnitude of $$\nabla f\left(\mathbf{x}_j\right)$$ is not zero, as reproduced in derivation 3.

Derivation 3: The Descent Property
Firstly, for $$j$$ equal to zero we have
$\frac{\nabla f\left(\mathbf{x}_0\right) \times \mathbf{v}_0}{\nabla f\left(\mathbf{x}_0\right) \times \nabla f\left(\mathbf{x}_0\right)} = \frac{\nabla f\left(\mathbf{x}_0\right) \times -\nabla f\left(\mathbf{x}_0\right)}{\nabla f\left(\mathbf{x}_0\right) \times \nabla f\left(\mathbf{x}_0\right)} = -1\\ -\sum_{i=0}^0 \sigma^i = -1 \quad\quad\quad -2 + \sum_{i=0}^0 \sigma^i = -2 + 1 = -1$
which trivially satisfy the inequalities.
Next, we have
\begin{align*} \frac{\nabla f\left(\mathbf{x}_{j+1}\right) \times \mathbf{v}_{j+1}}{\nabla f\left(\mathbf{x}_{j+1}\right) \times \nabla f\left(\mathbf{x}_{j+1}\right)} &= \frac{\nabla f\left(\mathbf{x}_{j+1}\right) \times \left(-\nabla f\left(\mathbf{x}_{j+1}\right)+\frac{\nabla f\left(\mathbf{x}_{j+1}\right) \times \nabla f\left(\mathbf{x}_{j+1}\right)}{\nabla f\left(\mathbf{x}_j\right) \times \nabla f\left(\mathbf{x}_j\right)}\times \mathbf{v}_j\right)}{\nabla f\left(\mathbf{x}_{j+1}\right) \times \nabla f\left(\mathbf{x}_{j+1}\right)}\\ &= -1 + \frac{\nabla f \left(\mathbf{x}_{j+1}\right) \times \mathbf{v}_j}{\nabla f\left(\mathbf{x}_j\right) \times \nabla f\left(\mathbf{x}_j\right)} \end{align*}
If the line search condition is met then
\begin{align*} \sigma \times \nabla f\left(\mathbf{x}_j\right) \times \mathbf{v}_j \leqslant \nabla f\left(\mathbf{x}_{j+1}\right) \times \mathbf{v}_j \leqslant -\sigma \times \nabla f\left(\mathbf{x}_j\right) \times \mathbf{v}_j \end{align*}
so that
\begin{align*} -1 + \sigma \times \frac{\nabla f\left(\mathbf{x}_j\right) \times \mathbf{v}_j}{\nabla f\left(\mathbf{x}_j\right) \times \nabla f\left(\mathbf{x}_j\right)} \leqslant \frac{\nabla f\left(\mathbf{x}_{j+1}\right) \times \mathbf{v}_{j+1}}{\nabla f\left(\mathbf{x}_{j+1}\right) \times \nabla f\left(\mathbf{x}_{j+1}\right)} \leqslant -1 - \sigma \times \frac{\nabla f\left(\mathbf{x}_j\right) \times \mathbf{v}_j}{\nabla f\left(\mathbf{x}_j\right) \times \nabla f\left(\mathbf{x}_j\right)} \end{align*}
Finally, if
$-\sum_{i=0}^j \sigma^i \leqslant \frac{\nabla f\left(\mathbf{x}_j\right) \times \mathbf{v}_j}{\nabla f\left(\mathbf{x}_j\right) \times \nabla f\left(\mathbf{x}_j\right)}$
then
\begin{align*} -1 - \sigma \times \sum_{i=0}^j \sigma^i = -\sum_{i=0}^{j+1} \sigma^i \leqslant \frac{\nabla f\left(\mathbf{x}_{j+1}\right) \times \mathbf{v}_{j+1}}{\nabla f\left(\mathbf{x}_{j+1}\right) \times \nabla f\left(\mathbf{x}_{j+1}\right)} \leqslant -1 + \sigma \times \sum_{i=0}^j \sigma^i = -2 + \sum_{i=0}^{j+1} \sigma^i \end{align*}
and so, by induction, the inequalities must hold for all $$j$$.

The sum is bounded above by the result of the geometric series
$\sum_{i=0}^\infty \sigma^i = \frac{1}{1-\sigma}$
and so the right hand side of the inequality must be negative if $$\sigma$$ is less than one half, meaning that the direction must be one that heads towards the minimum.

Note that the line search requirement is related to the Wolfe curvature condition
$\Delta \mathbf{x} \times \nabla f\left(\mathbf{x} + \alpha \times \Delta \mathbf{x}\right) \geqslant c_2 \times \Delta \mathbf{x} \times \nabla f\left(\mathbf{x}\right)$
in that if $$c_2$$ equals $$\sigma$$ then if the left hand side is negative or is positive but not too large then the former will be satisfied. If not, we can simply reset the Fletcher-Reeves algorithm by setting $$\beta_j$$ to zero which we must do in any event after the $$n^\mathrm{th}$$ step.

Using our ak.wolfeLineSearch, program 2 applies this to Rosenbrock's function
$f(\mathbf{x}) = 100 \times \left(x_1 - x_0^2\right)^2 + \left(1-x_0\right)^2$
which has a long, curved, steep sided and shallow bottomed valley with a minimum at $$(1,1)$$ and is frequently used as a test for minimisation algorithms.

Program 2: The Fletcher-Reeves Algorithm

This takes a somewhat unexpected route to the minimum, first stepping to a distant part of the valley floor. Furthermore its progress is rather slow as it approaches the minimum and struggles to find directions that don't point towards the sides of the valley.

### The Polak-Ribière Algorithm

The Fletcher-Reeves algorithm isn't the only way to choose search directions and of the numerous schemes, Polak and Ribière's[7]
$\beta_j = \frac{\mathbf{r}_j \times \left(\mathbf{r}_j - \mathbf{r}_{j-1}\right)}{\mathbf{r}_{j-1} \times \mathbf{r}_{j-1}}$
has the advantage that a negative value implies that a restart is required, which we can automate with
$\beta_j = \max\left(0,\frac{\mathbf{r}_j \times \left(\mathbf{r}_j - \mathbf{r}_{j-1}\right)}{\mathbf{r}_{j-1} \times \mathbf{r}_{j-1}}\right)$
Program 3 demonstrates its application to Rosenbrock's function.

Program 3: The Polak-Ribière Algorithm

This follows a rather less surprising path, although it too struggles to navigate the valley floor as it nears the minimum.

### ak.conjugateMinimum

We shall use the latter for our implementation of the conjugate gradients algorithm, as illustrated in listing 1.

Listing 1: The Conjugate Gradient Search
function update(r, b, v, n) {
var i;

r = r.toArray();
for(i=0;i<n;++i) r[i] += b*v.at(i);
return ak.vector(r);
}

function minimum(f, df, x, wolfe, threshold, steps) {
var eps, r0, r1, r02, r12, v, d, b, stop;

if(ak.type(x)!==ak.VECTOR_T || x.dims()===0) {
throw new Error('invalid starting point in ak.conjugateMinimum');
}

n = x.dims();
fx = f(x);
dfx = df(x);
eps = threshold*(1+ak.abs(x));
return !isNaN(fx) && !isNaN(adfx) ? x : ak.vector(n, ak.NaN);
}

r0 = ak.neg(dfx);
v = r0;
d = x.dims();

do {
y = wolfe(x, v, fx, dfx);
dx = ak.sub(y.x, x);
x = y.x;
fx = y.fx;

eps = threshold*(1+ak.abs(y.x));
if(!stop) {
if(--steps===0) throw new Error('failure to converge in ak.conjugateMinimum');
r1 = ak.neg(y.dfx);
d = (d+1)%n;
b = d!==0 ? Math.max(0, (r12-ak.mul(r1, r0))/r02) : 0;
v = b!==0 ? update(r1, b, v, n) : r1;
dfx = y.dfx;
r0 = r1;
r02 = r12;
if(b===0) d = 0;
}
}
while(!stop);

}


This is more or less identical to our ak.quasiNewtonMinimum implementation of the quasi-Newton algorithm, determining convergence by comparing magnitudes of the change in the function's value and the current gradient to the magnitude of the current point, adding one to the latter to cope with points that are very close to the origin, and returning a vector populated with NaNs of the value of the function or the magnitude of its derivative are NaN.
The iterations begin with a Wolfe line search and, if not stopping, update the search direction according to the Polak-Ribière formula with an automatic reset, albeit with the minor optimisation
$\mathbf{r}_j \times \left(\mathbf{r}_j - \mathbf{r}_{j-1}\right) = \mathbf{r}_j \times \mathbf{r}_j - \mathbf{r}_j \times \mathbf{r}_{j-1}$
which, together with the in place calculation of the new direction using update, avoids the creation of temporary vectors, additionally enforcing a reset after $$n$$ steps since the last.

Listing 2 gives the implementation of ak.conjugateMinimum, defined in ConjugateMinimum.js, which checks the validity of the function and derivative f and df, the Wolfe constants c1 and c2, the convergence threshold and maximum steps, providing appropriate default values for the last four when undefined.

Listing 2: ak.conjugateMinimum
ak.conjugateMinimum = function(f, df, c1, c2, threshold, steps) {
var wolfe;

if(ak.nativeType(f)!==ak.FUNCTION_T) {
throw new Error('invalid function in ak.conjugateMinimum');
}
if(ak.nativeType(df)!==ak.FUNCTION_T) {
throw new Error('invalid derivative in ak.conjugateMinimum');
}

c1 = ak.nativeType(c1)===ak.UNDEFINED_T ? 1.0e-4 : Number(c1);
if(!(c1>0 && c1<1)) {
throw new Error('invalid armijo constant in ak.conjugateMinimum');
}

c2 = ak.nativeType(c2)===ak.UNDEFINED_T ? 0.1 : Number(c2);
if(!(c2>c1 && c2<1)) {
throw new Error('invalid curvature constant in ak.conjugateMinimum');
}

threshold = ak.nativeType(threshold)===ak.UNDEFINED_T ? Math.pow(ak.EPSILON, 0.75)
: Math.abs(threshold);
if(isNaN(threshold)) {
throw new Error('invalid convergence threshold in ak.conjugateMinimum');
}

steps = ak.nativeType(steps)===ak.UNDEFINED_T ? 10000 : ak.floor(steps);
if(!(steps>0)) throw new Error('invalid steps in ak.conjugateMinimum');

wolfe = ak.wolfeLineSearch(f, df, c1, c2, steps);
return function(x) {return minimum(f, df, x, wolfe, threshold, steps);};
};


Program 4 illustrates how to use ak.conjugateMinimum, comparing its performance for Rosenbrock's function to that of ak.quasiNewtonMinimum.

Program 4: Using ak.conjugateMinimum

Being significantly less efficient, the conjugate gradients algorithm should only be preferred over the quasi-Newton algorithm if the number of dimensions is large enough to make the calculation and storage of the approximate Hessian matrix prohibitive.

### References

[1] Big Friendly GiantS, www.thusspakeak.com, 2021.

[2] Bring Out The Big Flipping GunS, www.thusspakeak.com, 2021.

[3] Wolfe, P., Convergence Conditions for Ascent Methods, SIAM Review, Vol. 11, No. 2, 1969.

[4] Are You Definitely Positive, Mr Cholesky?, www.thusspakeak.com, 2015.

[5] Fletcher, R. & Reeves, C., Function minimization by conjugate gradients, The Computer Journal, Vol. 7, No. 2, 1964.

[6] Al-Baali, M., Descent Property and Global Convergence of the Fletcher-Reeves Method with Inexact Line Search, IMA Journal of Numerical Analysis, Vol. 5, No. 1, 1985.

[7] Polak, E. & Ribière, G., Note sur la convergence de méthodes de directions conjuguées, Modélisation Mathématique et Analyse Numérique, Vol. 3, No. 1, 1969.