Big Friendly GiantS


In the previous post[1] we saw how we could perform a univariate line search for a point that satisfies the Wolfe conditions[2] meaning that it is reasonably close to a minimum and takes a lot less work to find than the minimum itself. Line searches are used in a class of multivariate minimisation algorithms which iteratively choose directions in which to proceed, in particular those that use approximations of the Hessian matrix of second partial derivatives of a function to do so, similarly to how the Levenberg-Marquardt multivariate inversion algorithm[3] uses a diagonal matrix in place of the sum of the products of its Hessian matrices for each element and the error in that element's current value, and in this post we shall take a look at one of them.

A Step In The Right Direction

We can approximate the value of a function at some step \(\Delta \mathbf{x}\) from an initial point \(\mathbf{x}\) using the first three terms of its multivariate Taylor series
\[ f(\mathbf{x} + \Delta \mathbf{x}) \approx f(\mathbf{x}) + \Delta \mathbf{x} \times \nabla f(\mathbf{x}) + \tfrac12 \times \Delta \mathbf{x} \times \mathbf{H}(\mathbf{x}) \times \Delta \mathbf{x} \]
where \(\nabla f(\mathbf{x})\) is the vector of first partial derivatives of \(f\) at \(\mathbf{x}\) and \(\mathbf{H}(\mathbf{x})\) is its Hessian. The derivative of this approximation with respect to \(\Delta \mathbf{x}\) is given by
\[ \nabla f(\mathbf{x}) + \tfrac12 \times \Delta \mathbf{x} \times \mathbf{H}(\mathbf{x}) + \tfrac12 \times \mathbf{H}(\mathbf{x})\times \Delta \mathbf{x} \]
Since \(\mathbf{H}(\mathbf{x})\) is symmetric we have
\[ \Delta \mathbf{x} \times \mathbf{H}(\mathbf{x}) = \mathbf{H}(\mathbf{x})\times \Delta \mathbf{x} \]
and we can therefore approximate the derivative of \(f\) at the result of the step with
\[ \nabla f(\mathbf{x} + \Delta \mathbf{x}) \approx \nabla f(\mathbf{x}) + \mathbf{H}(\mathbf{x}) \times \Delta \mathbf{x} \]
Since the derivative equals zero at a minimum, it's fair to assume that a reasonable step would be given by
\[ \nabla f(\mathbf{x}) + \mathbf{H}(\mathbf{x}) \times \Delta \mathbf{x} = \mathbf{0}\\ \Delta \mathbf{x} = -\mathbf{H}^{-1}(\mathbf{x}) \times \nabla f(\mathbf{x}) \]
Iteratively taking such steps until convergence is known as Newton's method, although it's important to note that it won't necessarily converge upon a minimum since local maxima and saddle points, which are minima in some directions and maxima in others, also have gradients of zero. We can determine if it has by testing whether the Hessian is positive definite at the point of convergence.

Quasi-Newton Methods

Unfortunately calculating the Hessian is a relatively expensive operation and so quasi-Newton methods approximate it instead. Specifically, if the \(k^\mathrm{th}\) step starts at \(\mathbf{x}_k\) and moves \(\Delta \mathbf{x}_k\) then the approximate Hessian at the next step \(\hat{\mathbf{H}}_{k+1}\) is chosen to satisfy
\[ \begin{align*} \nabla f\left(\mathbf{x}_k + \Delta \mathbf{x}_k\right) &= \nabla f\left(\mathbf{x}_k\right) + \hat{\mathbf{H}}_{k+1} \times \Delta \mathbf{x}_k\\ \hat{\mathbf{H}}_{k+1} \times \Delta \mathbf{x}_k &= \nabla f\left(\mathbf{x}_k + \Delta \mathbf{x}_k\right) - \nabla f\left(\mathbf{x}_k\right) = \mathbf{y}_k \end{align*} \]
In \(n\) dimensions this represents \(n\) simultaneous equations with \(\frac12 \times n \times (n+1)\) unknown variables, since every \(\hat{\mathbf{H}}_k\) must be symmetric, and so doesn't have a unique solution. Quasi-Newton methods therefore impose further constraints such as minimal deviation, in some sense, from \(\hat{\mathbf{H}}_k\) and conservation of positive definiteness.
For example, we might choose to define minimal deviation by reducing the number of unknowns to \(n\) with
\[ \begin{align*} \hat{\mathbf{H}}_{k+1} &= \hat{\mathbf{H}}_k + \Delta \hat{\mathbf{H}}_k\\ \Delta \hat{\mathbf{H}}_k &= \mathbf{h}_k \otimes \mathbf{h}_k \end{align*} \]
where \(\otimes\) is the matrix outer product of a pair of vectors, satisfying
\[ \begin{align*} \mathbf{M} &= \mathbf{u} \otimes \mathbf{v}\\ m_{ij} &= u_i \times v_j \end{align*} \]
implying that
\[ \mathbf{h}_k \otimes \mathbf{h}_k = \frac{\left(\mathbf{y}_k - \hat{\mathbf{H}}_k \times \Delta \mathbf{x}_k\right) \otimes \left(\mathbf{y}_k - \hat{\mathbf{H}}_k \times \Delta \mathbf{x}_k\right)}{\left(\mathbf{y}_k - \hat{\mathbf{H}}_k \times \Delta \mathbf{x}_k\right) \times \Delta \mathbf{x}_k} \]
as proven in derivation 1.

Derivation 1: The Outer Product
To satisfy
\[ \left(\hat{\mathbf{H}} + \mathbf{h} \otimes \mathbf{h}\right) \times \Delta \mathbf{x} = \mathbf{y} \]
we require
\[ \mathbf{h} \otimes \mathbf{h} \times \Delta \mathbf{x} = \mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x} \]
Noting that
\[ \mathbf{u} \otimes \mathbf{v} \times \mathbf{w} = \mathbf{u} \times \left(\mathbf{v} \times \mathbf{w}\right) \]
which can be verified by calculating the \(i^\mathrm{th}\) element of the left hand side
\[ \sum_j \left(\mathbf{u} \otimes \mathbf{v}\right)_{ij} \times w_j = \sum_j u_i \times v_j \times w_j = u_i \times \left(\mathbf{v} \times \mathbf{w}\right) \]
where \(\sum\) is the summation sign, we have
\[ \begin{align*} \mathbf{h} \otimes \mathbf{h} \times \Delta \mathbf{x} &= \frac{\left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right) \otimes \left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right)}{\left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right) \times \Delta \mathbf{x}} \times \Delta \mathbf{x}\\ &= \left(\left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right) \otimes \frac{\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}}{\left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right) \times \Delta \mathbf{x}}\right) \times \Delta \mathbf{x}\\ &= \left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right) \times \left(\frac{\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}}{\left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right) \times \Delta \mathbf{x}} \times \Delta \mathbf{x}\right)\\ &= \left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right) \times \frac{\left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right) \times \Delta \mathbf{x}}{\left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right) \times \Delta \mathbf{x}} = \mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x} \end{align*} \]

If \(\mathbf{M}\) is an invertible matrix then the inverse of its sum with the outer product of a pair of vectors \(\mathbf{u}\) and \(\mathbf{v}\) is given by
\[ \left(\mathbf{M} + \mathbf{u} \otimes \mathbf{v}\right)^{-1} = \mathbf{M}^{-1} - \frac{\mathbf{M}^{-1} \times \left(\mathbf{u} \otimes \mathbf{v}\right) \times \mathbf{M}^{-1}}{1 + \mathbf{v} \times \mathbf{M}^{-1} \times \mathbf{u}} \]
if the denominator is non-zero, as proven in derivation 2.

Derivation 2: The Inverse Of The Sum
\[ \left(\mathbf{M}^{-1} - \frac{\mathbf{M}^{-1} \times \left(\mathbf{u} \otimes \mathbf{v}\right) \times \mathbf{M}^{-1}}{1 + \mathbf{v} \times \mathbf{M}^{-1} \times \mathbf{u}}\right) \times \left(\mathbf{M} + \mathbf{u} \otimes \mathbf{v}\right)\\ \quad= \mathbf{I} - \frac{\mathbf{M}^{-1} \times \left(\mathbf{u} \otimes \mathbf{v}\right)}{1 + \mathbf{v} \times \mathbf{M}^{-1} \times \mathbf{u}} + \mathbf{M}^{-1} \times \left(\mathbf{u} \otimes \mathbf{v}\right) - \frac{\mathbf{M}^{-1} \times \left(\mathbf{u} \otimes \mathbf{v}\right) \times \mathbf{M}^{-1} \times \left(\mathbf{u} \otimes \mathbf{v}\right)}{1 + \mathbf{v} \times \mathbf{M}^{-1} \times \mathbf{u}}\\ \quad= \mathbf{I} + \mathbf{M}^{-1} \times \left(\mathbf{u} \otimes \mathbf{v}\right) - \frac{\mathbf{M}^{-1} \times \left(\mathbf{u} \otimes \mathbf{v}\right) + \mathbf{M}^{-1} \times \left(\mathbf{u} \otimes \mathbf{v}\right) \times \mathbf{M}^{-1} \times \left(\mathbf{u} \otimes \mathbf{v}\right)}{1 + \mathbf{v} \times \mathbf{M}^{-1} \times \mathbf{u}}\\ \quad= \mathbf{I} + \mathbf{M}^{-1} \times \left(\mathbf{u} \otimes \mathbf{v}\right) - \frac{\left(\mathbf{M}^{-1} \times \mathbf{u} + \left(\mathbf{M}^{-1} \times \mathbf{u}\right) \times \left(\mathbf{v} \times \mathbf{M}^{-1} \times \mathbf{u}\right)\right) \otimes \mathbf{v}}{1 + \mathbf{v} \times \mathbf{M}^{-1} \times \mathbf{u}}\\ \quad= \mathbf{I} + \mathbf{M}^{-1} \times \left(\mathbf{u} \otimes \mathbf{v}\right) - \frac{\left(\left(\mathbf{M}^{-1} \times \mathbf{u}\right) \times \left(1 + \mathbf{v} \times \mathbf{M}^{-1} \times \mathbf{u}\right)\right) \otimes \mathbf{v}}{1 + \mathbf{v} \times \mathbf{M}^{-1} \times \mathbf{u}}\\ \quad= \mathbf{I} + \mathbf{M}^{-1} \times \left(\mathbf{u} \otimes \mathbf{v}\right) - \mathbf{M}^{-1} \times \left(\mathbf{u} \otimes \mathbf{v}\right) = \mathbf{I} \]
\[ \left(\mathbf{M} + \mathbf{u} \otimes \mathbf{v}\right) \times \left(\mathbf{M}^{-1} - \frac{\mathbf{M}^{-1} \times \left(\mathbf{u} \otimes \mathbf{v}\right) \times \mathbf{M}^{-1}}{1 + \mathbf{v} \times \mathbf{M}^{-1} \times \mathbf{u}}\right)\\ \quad= \mathbf{I} + \left(\mathbf{u} \otimes \mathbf{v}\right) \times \mathbf{M}^{-1} - \frac{\left(\mathbf{u} \otimes \mathbf{v}\right) \times \mathbf{M}^{-1} + \left(\mathbf{u} \otimes \mathbf{v}\right) \times \mathbf{M}^{-1} \times \left(\mathbf{u} \otimes \mathbf{v}\right) \times \mathbf{M}^{-1}}{1 + \mathbf{v} \times \mathbf{M}^{-1} \times \mathbf{u}}\\ \quad= \mathbf{I} + \left(\mathbf{u} \otimes \mathbf{v}\right) \times \mathbf{M}^{-1} - \frac{\mathbf{u} \otimes \left(\left(1 + \mathbf{v} \times \mathbf{M}^{-1} \times \mathbf{u}\right) \times \left(\mathbf{v} \times \mathbf{M}^{-1}\right)\right)}{1 + \mathbf{v} \times \mathbf{M}^{-1} \times \mathbf{u}}\\ \quad= \mathbf{I} + \left(\mathbf{u} \otimes \mathbf{v}\right) \times \mathbf{M}^{-1} - \left(\mathbf{u} \otimes \mathbf{v}\right) \times \mathbf{M}^{-1} = \mathbf{I} \]

Exploiting this allows us to update the inverse of the approximate Hessian directly with
\[ \hat{\mathbf{H}}_{k+1}^{-1} = \hat{\mathbf{H}}_k^{-1} + \frac{\left(\Delta \mathbf{x} - \hat{\mathbf{H}}_k^{-1} \times \mathbf{y}_k\right) \otimes \left(\Delta \mathbf{x} - \hat{\mathbf{H}}_k^{-1} \times \mathbf{y}_k\right)}{\left(\Delta \mathbf{x} - \hat{\mathbf{H}}_k^{-1} \times \mathbf{y}_k\right) \times \mathbf{y}_k} \]
as shown by derivation 3.

Derivation 3: Updating The Inverse Of The Approximate Hessian
Noting that \(\hat{\mathbf{H}}^{-1}\) is symmetric so that
\[ \begin{align*} \mathbf{y} \times \hat{\mathbf{H}}^{-1} = \hat{\mathbf{H}}^{-1} \times \mathbf{y} \end{align*} \]
we have
\[ \begin{align*} \frac{\hat{\mathbf{H}}^{-1} \times \left(\mathbf{h} \otimes \mathbf{h}\right) \times \hat{\mathbf{H}}^{-1}}{1 + \mathbf{h} \times \hat{\mathbf{H}}^{-1} \times \mathbf{h}} &= \frac{\hat{\mathbf{H}}^{-1} \times \left(\left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right) \otimes \left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right)\right) \times \hat{\mathbf{H}}^{-1}}{\left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right) \times \Delta \mathbf{x} + \left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right) \times \hat{\mathbf{H}}^{-1} \times \left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right)}\\ &= \frac{\left(\hat{\mathbf{H}}^{-1} \times \left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right)\right) \otimes \left(\left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right) \times \hat{\mathbf{H}}^{-1}\right)}{\left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right) \times \Delta \mathbf{x} + \left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right) \times \hat{\mathbf{H}}^{-1} \times \left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right)}\\ &= \frac{\left(\hat{\mathbf{H}}^{-1} \times \mathbf{y} - \Delta \mathbf{x}\right) \otimes \left(\hat{\mathbf{H}}^{-1} \times \mathbf{y} - \Delta \mathbf{x}\right)}{\left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right) \times \Delta \mathbf{x} + \left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right) \times \left(\hat{\mathbf{H}}^{-1} \times \mathbf{y} - \Delta \mathbf{x}\right)}\\ &= \frac{\left(\hat{\mathbf{H}}^{-1} \times \mathbf{y} - \Delta \mathbf{x}\right) \otimes \left(\hat{\mathbf{H}}^{-1} \times \mathbf{y} - \Delta \mathbf{x}\right)}{\left(\mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x}\right) \times \left(\hat{\mathbf{H}}^{-1} \times \mathbf{y}\right)}\\ &= \frac{\left(\hat{\mathbf{H}}^{-1} \times \mathbf{y} - \Delta \mathbf{x}\right) \otimes \left(\hat{\mathbf{H}}^{-1} \times \mathbf{y} - \Delta \mathbf{x}\right)}{\left(\hat{\mathbf{H}}^{-1} \times \mathbf{y} - \Delta \mathbf{x}\right) \times \mathbf{y}}\\ &= -\frac{\left(\Delta \mathbf{x} - \hat{\mathbf{H}}^{-1} \times \mathbf{y}\right) \otimes \left(\Delta \mathbf{x} - \hat{\mathbf{H}}^{-1} \times \mathbf{y}\right)}{\left(\Delta \mathbf{x} - \hat{\mathbf{H}}^{-1} \times \mathbf{y}\right) \times \mathbf{y}} \end{align*} \]

Unfortunately this scheme does not guarantee that the positive definiteness of the Hessian is preserved and so we should prefer an alternative approach that does.

Staying Positive

The most commonly used is the Broyden–Fletcher–Goldfarb–Shanno[4], or BFGS, update which is defined by
\[ \begin{align*} \Delta \hat{\mathbf{H}}_k &= \frac{\mathbf{y}_k \otimes \mathbf{y}_k}{\mathbf{y}_k \times \Delta \mathbf{x}_k} - \frac{\hat{\mathbf{H}}_k \times \left(\Delta \mathbf{x}_k \otimes \Delta \mathbf{x}_k\right) \times \hat{\mathbf{H}}_k}{\Delta \mathbf{x}_k \times \hat{\mathbf{H}}_k \times \Delta \mathbf{x}_k}\\ \Delta \hat{\mathbf{H}}_k^{-1} &= \frac{\left(\Delta \mathbf{x}_k \times \mathbf{y}_k + \mathbf{y}_k \times \hat{\mathbf{H}}_k^{-1} \times \mathbf{y}_k\right) \times \left(\Delta \mathbf{x}_k \otimes \Delta \mathbf{x}_k\right)}{\left(\Delta \mathbf{x}_k \times \mathbf{y}_k\right)^2} - \frac{\hat{\mathbf{H}}_k^{-1} \times \left(\mathbf{y}_k \otimes \Delta \mathbf{x}_k\right) + \left(\Delta \mathbf{x}_k \otimes \mathbf{y}_k\right) \times \hat{\mathbf{H}}_k^{-1}}{\Delta \mathbf{x}_k \times \mathbf{y}_k} \end{align*} \]
This adds two outer products to update the Hessian
\[ \hat{\mathbf{H}}_{k+1} = \hat{\mathbf{H}}_k + a_k \times \mathbf{u}_k \otimes \mathbf{u}_k + b_k \times \mathbf{v}_k \otimes \mathbf{v}_k \]
\[ \begin{align*} \mathbf{u}_k &= \mathbf{y}_k\\ \mathbf{v}_k &= \hat{\mathbf{H}}_k \times \Delta \mathbf{x}_k \end{align*} \]
implying that
\[ \begin{align*} a_k &= \frac{1}{\mathbf{y}_k \times \Delta \mathbf{x}_k}\\ b_k &= -\frac{1}{\Delta \mathbf{x}_k \times \hat{\mathbf{H}}_k \times \Delta \mathbf{x}_k} \end{align*} \]
as shown by derivation 4.

Derivation 4: The Values Of a And b
\[ \left(\hat{\mathbf{H}} + \frac{\mathbf{y} \otimes \mathbf{y}}{\mathbf{y} \times \Delta \mathbf{x}} - \frac{\left(\hat{\mathbf{H}} \times \Delta \mathbf{x}\right) \otimes \left(\hat{\mathbf{H}} \times \Delta \mathbf{x}\right)}{\Delta \mathbf{x} \times \hat{\mathbf{H}} \times \Delta \mathbf{x}}\right) \times \Delta \mathbf{x} \]
\[ \hat{\mathbf{H}} \times \Delta \mathbf{x} + \frac{\left(\mathbf{y} \otimes \mathbf{y}\right) \times \Delta \mathbf{x}}{\mathbf{y} \times \Delta \mathbf{x}} - \frac{\hat{\mathbf{H}} \times \left(\Delta \mathbf{x} \otimes \Delta \mathbf{x}\right) \times \hat{\mathbf{H}} \times \Delta \mathbf{x}}{\Delta \mathbf{x} \times \hat{\mathbf{H}} \times \Delta \mathbf{x}}\\ \quad= \hat{\mathbf{H}} \times \Delta \mathbf{x} + \frac{\mathbf{y} \times \left(\mathbf{y} \times \Delta \mathbf{x}\right)}{\mathbf{y} \times \Delta \mathbf{x}} - \frac{\left(\hat{\mathbf{H}} \times \Delta \mathbf{x}\right) \times \left(\Delta \mathbf{x} \times \hat{\mathbf{H}} \times \Delta \mathbf{x}\right)}{\Delta \mathbf{x} \times \hat{\mathbf{H}} \times \Delta \mathbf{x}}\\ \quad= \hat{\mathbf{H}} \times \Delta \mathbf{x} + \mathbf{y} - \hat{\mathbf{H}} \times \Delta \mathbf{x} = \mathbf{y} \]
as required.

Derivation 5 similarly demonstrates that the update to the inverse Hessian satisfies the relationship between \(\Delta \mathbf{x}_k\) and \(\mathbf{y}_k\).

Derivation 5: The Inverse Update
Defining \(\mathbf{a}\) as the first term in the update of the inverse Hessian, we have
\[ \begin{align*} \mathbf{a} \times \mathbf{y} &= \frac{\left(\Delta \mathbf{x} \times \mathbf{y} + \mathbf{y} \times \hat{\mathbf{H}}^{-1} \times \mathbf{y}\right) \times \left(\Delta \mathbf{x} \otimes \Delta \mathbf{x}\right)}{\left(\Delta \mathbf{x} \times \mathbf{y}\right)^2} \times \mathbf{y}\\ &= \frac{\left(\Delta \mathbf{x} \times \mathbf{y} + \mathbf{y} \times \hat{\mathbf{H}}^{-1} \times \mathbf{y}\right) \times \left(\Delta \mathbf{x} \otimes \Delta \mathbf{x}\right) \times \mathbf{y}}{\left(\Delta \mathbf{x} \times \mathbf{y}\right)^2}\\ &= \frac{\left(\Delta \mathbf{x} \times \mathbf{y} + \mathbf{y} \times \hat{\mathbf{H}}^{-1} \times \mathbf{y}\right) \times \Delta \mathbf{x} \times \left(\Delta \mathbf{x} \times \mathbf{y}\right)}{\left(\Delta \mathbf{x} \times \mathbf{y}\right)^2}\\ &= \frac{\left(\Delta \mathbf{x} \times \mathbf{y} \times \Delta \mathbf{x} + \mathbf{y} \times \hat{\mathbf{H}}^{-1} \times \mathbf{y} \times \Delta \mathbf{x}\right) }{\Delta \mathbf{x} \times \mathbf{y}} = \Delta \mathbf{x} + \hat{\mathbf{H}}^{-1} \times \mathbf{y} \end{align*} \]
Similarly defining \(\mathbf{b}\) as the negation of the second term yields
\[ \begin{align*} \mathbf{b} \times \mathbf{y} &= \frac{\hat{\mathbf{H}}^{-1} \times \left(\mathbf{y} \otimes \Delta \mathbf{x}\right) + \left(\Delta \mathbf{x} \otimes \mathbf{y}\right) \times \hat{\mathbf{H}}^{-1}}{\Delta \mathbf{x} \times \mathbf{y}} \times \mathbf{y}\\ &= \frac{\hat{\mathbf{H}}^{-1} \times \left(\mathbf{y} \otimes \Delta \mathbf{x}\right) \times \mathbf{y} + \left(\Delta \mathbf{x} \otimes \mathbf{y}\right) \times \hat{\mathbf{H}}^{-1} \times \mathbf{y}}{\Delta \mathbf{x} \times \mathbf{y}}\\ &= \frac{\hat{\mathbf{H}}^{-1} \times \mathbf{y} \times \left(\Delta \mathbf{x} \times \mathbf{y}\right) + \Delta \mathbf{x} \times \left(\mathbf{y} \times \hat{\mathbf{H}}^{-1} \times \mathbf{y}\right)}{\Delta \mathbf{x} \times \mathbf{y}}\\ &= \frac{\hat{\mathbf{H}}^{-1} \times \mathbf{y} \times \left(\Delta \mathbf{x} \times \mathbf{y}\right) + \left(\Delta \mathbf{x} \times \mathbf{y}\right) \times \hat{\mathbf{H}}^{-1} \times \mathbf{y}}{\Delta \mathbf{x} \times \mathbf{y}} = 2 \times \hat{\mathbf{H}}^{-1} \times \mathbf{y} \end{align*} \]
Finally, the product of the updated inverse and \(\mathbf{y}\) is
\[ \left(\hat{\mathbf{H}}^{-1} + \mathbf{a} - \mathbf{b}\right) \times \mathbf{y} = \hat{\mathbf{H}}^{-1} \times \mathbf{y} + \Delta \mathbf{x} + \hat{\mathbf{H}}^{-1} \times \mathbf{y} - 2 \times \hat{\mathbf{H}}^{-1} \times \mathbf{y} = \Delta \mathbf{x} \]
as required.

To show that \(\hat{\mathbf{H}}_{k+1}\) is positive definite we first note that if \(\hat{\mathbf{H}}_k\) is positive definite then it has a Cholesky decomposition[5]
\[ \hat{\mathbf{H}}_k = \mathbf{L}_k \times \mathbf{L}^\mathrm{T}_k \]
where \(\mathbf{L}\) is lower triangular and not singular, so that it has an inverse. Defining
\[ \begin{align*} \Delta \mathbf{x}_k^\prime &= \mathbf{L}_k^\mathrm{T} \times \Delta \mathbf{x}_k\\ \mathbf{y}_k^\prime &= \mathbf{L}_k^{-1} \times \mathbf{y}_k \end{align*} \]
we have
\[ \hat{\mathbf{H}}_{k+1} = \mathbf{L}_k \times \mathbf{M}_k \times \mathbf{L}_k^\mathrm{T} \]
\[ \mathbf{M}_k = \mathbf{I} + \frac{\mathbf{y}_k^\prime \otimes \mathbf{y}_k^\prime}{\mathbf{y}_k^\prime \times \Delta \mathbf{x}_k^\prime} - \frac{\Delta \mathbf{x}_k^\prime \otimes \Delta \mathbf{x}_k^\prime}{\Delta \mathbf{x}_k^\prime \times \Delta \mathbf{x}_k^\prime} \]
as proven in derivation 6.

Derivation 6: The Transformed Update
Given the Cholesky decomposition
\[ \hat{\mathbf{H}} = \mathbf{L} \times \mathbf{L}^\mathrm{T} \]
\[ \begin{align*} \Delta \mathbf{x}^\prime &= \mathbf{L}^\mathrm{T} \times \Delta \mathbf{x} = \Delta \mathbf{x} \times \mathbf{L}\\ \mathbf{y}^\prime &= \mathbf{L}^{-1} \times \mathbf{y} = \mathbf{y} \times \mathbf{L}^{\mathrm{T}^{-1}} \end{align*} \]
the update to the Hessian can be expressed as
\[ \begin{align*} \Delta \hat{\mathbf{H}} &= \frac{\mathbf{y} \otimes \mathbf{y}}{\mathbf{y} \times \Delta \mathbf{x}} - \frac{\hat{\mathbf{H}} \times \left(\Delta \mathbf{x} \otimes \Delta \mathbf{x}\right) \times \hat{\mathbf{H}}}{\Delta \mathbf{x} \times \hat{\mathbf{H}} \times \Delta \mathbf{x}}\\ &= \frac{\mathbf{L} \times \left(\mathbf{y}^\prime \otimes \mathbf{y}^\prime\right) \times \mathbf{L}^\mathrm{T}}{\left(\mathbf{y}^\prime \times \mathbf{L}^\mathrm{T}\right) \times \left(\mathbf{L}^{\mathrm{T}^{-1}} \times \Delta \mathbf{x}^\prime\right)} - \frac{\mathbf{L} \times \left(\Delta \mathbf{x}^\prime \otimes \Delta \mathbf{x}^\prime\right) \times \mathbf{L}^\mathrm{T}}{\left(\Delta \mathbf{x}^\prime \times \mathbf{L}^{-1}\right) \times \left(\mathbf{L} \times \mathbf{L}^\mathrm{T}\right) \times \left(\mathbf{L}^{\mathrm{T}^{-1}} \times \Delta \mathbf{x}^\prime\right)}\\ &= \frac{\mathbf{L} \times \left(\mathbf{y}^\prime \otimes \mathbf{y}^\prime\right) \times \mathbf{L}^\mathrm{T}}{\mathbf{y}^\prime \times \Delta \mathbf{x}^\prime} - \frac{\mathbf{L} \times \left(\Delta \mathbf{x}^\prime \otimes \Delta \mathbf{x}^\prime\right) \times \mathbf{L}^\mathrm{T}}{\Delta \mathbf{x}^\prime \times \Delta \mathbf{x}^\prime} \end{align*} \]
so that
\[ \hat{\mathbf{H}} + \Delta \hat{\mathbf{H}} = \mathbf{L} \times \left(\mathbf{I} + \frac{\mathbf{y}^\prime \otimes \mathbf{y}^\prime}{\mathbf{y}^\prime \times \Delta \mathbf{x}^\prime} - \frac{\Delta \mathbf{x}^\prime \otimes \Delta \mathbf{x}^\prime}{\Delta \mathbf{x}^\prime \times \Delta \mathbf{x}^\prime}\right) \times \mathbf{L}^\mathrm{T} \]

Secondly, derivation 7 proves that the two non-unit eigenvalues of \(\mathbf{M}_k\) must satisfy the quadratic equation in \(\lambda\)
\[ \lambda^2 - \left(1 + \frac{\mathbf{y}_k^\prime \times \mathbf{y}_k^\prime}{\mathbf{y}_k^\prime \times \Delta \mathbf{x}_k^\prime}\right) \times \lambda + \frac{\Delta \mathbf{x}_k^\prime \times \mathbf{y}_k^\prime}{\Delta \mathbf{x}_k^\prime \times \Delta \mathbf{x}_k^\prime} = 0 \]

Derivation 7: The Eigenvalues Of M
Since \(\mathbf{M}\) adds two outer products to the identity matrix there can be at most two eigenvectors with eigenvalues not equal to one and, furthermore, they must be a linear combination of the eigenvectors of those outer products, which are trivially the vectors themselves. Given the products of those vectors with \(\mathbf{M}\)
\[ \begin{align*} \mathbf{M} \times \Delta \mathbf{x}^\prime &= \Delta \mathbf{x}^\prime + \frac{\mathbf{y}^\prime \otimes \mathbf{y}^\prime}{\mathbf{y}^\prime \times \Delta \mathbf{x}^\prime} \times \Delta \mathbf{x}^\prime - \frac{\Delta \mathbf{x}^\prime \otimes \Delta \mathbf{x}^\prime}{\Delta \mathbf{x}^\prime \times \Delta \mathbf{x}^\prime} \times \Delta \mathbf{x}^\prime\\ &= \Delta \mathbf{x}^\prime + \frac{\mathbf{y}^\prime \times \left(\mathbf{y}^\prime \times \Delta \mathbf{x}^\prime\right)}{\mathbf{y}^\prime \times \Delta \mathbf{x}^\prime} - \frac{\Delta \mathbf{x}^\prime \times \left(\Delta \mathbf{x}^\prime \times \Delta \mathbf{x}^\prime\right)}{\Delta \mathbf{x}^\prime \times \Delta \mathbf{x}^\prime} = \mathbf{y}^\prime\\ \mathbf{M} \times \mathbf{y}^\prime &= \mathbf{y}^\prime + \frac{\mathbf{y}^\prime \otimes \mathbf{y}^\prime}{\mathbf{y}^\prime \times \Delta \mathbf{x}^\prime} \times \mathbf{y}^\prime - \frac{\Delta \mathbf{x}^\prime \otimes \Delta \mathbf{x}^\prime}{\Delta \mathbf{x}^\prime \times \Delta \mathbf{x}^\prime} \times \mathbf{y}^\prime\\ &= \mathbf{y}^\prime + \frac{\mathbf{y}^\prime \times \left(\mathbf{y}^\prime \times \mathbf{y}^\prime\right)}{\mathbf{y}^\prime \times \Delta \mathbf{x}^\prime} - \frac{\Delta \mathbf{x}^\prime \times \left(\Delta \mathbf{x}^\prime \times \mathbf{y}^\prime\right)}{\Delta \mathbf{x}^\prime \times \Delta \mathbf{x}^\prime}\\ &= -\frac{\Delta \mathbf{x}^\prime \times \mathbf{y}^\prime}{\Delta \mathbf{x}^\prime \times \Delta \mathbf{x}^\prime} \times \Delta \mathbf{x}^\prime + \left(1 + \frac{\mathbf{y}^\prime \times \mathbf{y}^\prime}{\mathbf{y}^\prime \times \Delta \mathbf{x}^\prime}\right) \times \mathbf{y}^\prime \end{align*} \]
we require that eigenvectors
\[ a \times \Delta \mathbf{x}^\prime + b \times \mathbf{y}^\prime \]
\[ \begin{align*} -b \times \frac{\Delta \mathbf{x}^\prime \times \mathbf{y}^\prime}{\Delta \mathbf{x}^\prime \times \Delta \mathbf{x}^\prime} \times \Delta \mathbf{x}^\prime &= \lambda \times a \times \Delta \mathbf{x}^\prime\\ a \times \mathbf{y}^\prime + b \times \left(1 + \frac{\mathbf{y}^\prime \times \mathbf{y}^\prime}{\mathbf{y}^\prime \times \Delta \mathbf{x}^\prime}\right) \times \mathbf{y}^\prime &= \lambda \times b \times \mathbf{y}^\prime \end{align*} \]
Assuming that \(\Delta \mathbf{x}^\prime\) and \(\mathbf{y}^\prime\) are non-zero, which will be the case if the minimisation hasn't converged, we can cancel them out and set \(b\) to one, yielding
\[ \begin{align*} -\frac{\Delta \mathbf{x}^\prime \times \mathbf{y}^\prime}{\Delta \mathbf{x}^\prime \times \Delta \mathbf{x}^\prime} &= a \times \lambda\\ a + \left(1 + \frac{\mathbf{y}^\prime \times \mathbf{y}^\prime}{\mathbf{y}^\prime \times \Delta \mathbf{x}^\prime}\right) &= \lambda \end{align*} \]
Substituting \(a\) and rearranging, we have
\[ \lambda^2 - \left(1 + \frac{\mathbf{y}^\prime \times \mathbf{y}^\prime}{\mathbf{y}^\prime \times \Delta \mathbf{x}^\prime}\right) \times \lambda + \frac{\Delta \mathbf{x}^\prime \times \mathbf{y}^\prime}{\Delta \mathbf{x}^\prime \times \Delta \mathbf{x}^\prime} = 0 \]

Thirdly, those eigenvalues must be positive, meaning that \(\mathbf{M}_k\) is positive definite, if \(\Delta \mathbf{x}_k \times \mathbf{y}_k\) is greater than zero, as shown by derivation 8.

Derivation 8: Ensuring That The Eigenvalues Of M Are Positive
A quadratic equation
\[ a \times x^2 + b \times x + c = 0 \]
has the solutions
\[ \begin{align*} x_0 &= \frac{-b + \sqrt{b^2 - 4 \times a \times c}}{2 \times a}\\ x_1 &= \frac{-b - \sqrt{b^2 - 4 \times a \times c}}{2 \times a} \end{align*} \]
and their sum and product are trivially
\[ \begin{align*} x_0 + x_1 &= \frac{-b}{a}\\ x_0 \times x_1 &= \frac{b^2 - \left(b^2 - 4 \times a \times c\right)}{4 \times a^2} = \frac{c}{a} \end{align*} \]
For the non-unit eigenvalues of \(\mathbf{M}_k\), this means that
\[ \begin{align*} \lambda_0 + \lambda_1 &= 1 + \frac{\mathbf{y}_k^\prime \times \mathbf{y}_k^\prime}{\mathbf{y}_k^\prime \times \Delta \mathbf{x}_k^\prime}\\ \lambda_0 \times \lambda_1 &= \frac{\Delta \mathbf{x}_k^\prime \times \mathbf{y}_k^\prime}{\Delta \mathbf{x}_k^\prime \times \Delta \mathbf{x}_k^\prime} \end{align*} \]
and so all of its eigenvalues must be positive, meaning that it is positive definite, if
\[ \Delta \mathbf{x}_k^\prime \times \mathbf{y}_k^\prime = \left(\mathbf{L}_k^\mathrm{T} \times \Delta \mathbf{x}_k\right) \times \left(\mathbf{L}_k^{-1} \times \mathbf{y}_k\right) = \left(\Delta \mathbf{x}_k \times \mathbf{L}_k\right) \times \left(\mathbf{L}_k^{-1} \times \mathbf{y}_k\right) = \Delta \mathbf{x}_k \times \mathbf{y}_k > 0 \]

Penultimately, \(\hat{\mathbf{H}}_{k+1}\) is positive definite if
\[ \mathbf{z} \times \hat{\mathbf{H}}_{k+1} \times \mathbf{z} > 0 \]
for all non-zero \(\mathbf{z}\), which will be satisfied if
\[ \begin{align*} \mathbf{z} \times \left(\mathbf{L}_k \times \mathbf{M}_k \times \mathbf{L}_k^\mathrm{T}\right) \times \mathbf{z} &> 0\\ \left(\mathbf{z}^\prime \times \mathbf{L}^{-1}\right) \times \left(\mathbf{L}_k \times \mathbf{M}_k \times \mathbf{L}_k^\mathrm{T}\right) \times \left(\mathbf{L}^{\mathrm{T}^{-1}} \times \mathbf{z}^\prime\right) &> 0\\ \mathbf{z}^\prime \times \times \mathbf{M}_k \times \mathbf{z}^\prime &> 0 \end{align*} \]
which holds for all non-zero \(\mathbf{z}^\prime\) since \(\mathbf{L}_k\) is invertible.

Finally, if the Wolfe curvature condition
\[ \Delta \mathbf{x}_k \times \nabla f\left(\mathbf{x}_k + \alpha_k \times \Delta \mathbf{x}_k\right) \geqslant c_2 \times \Delta \mathbf{x}_k \times \nabla f\left(\mathbf{x}_k\right) \]
is met by appropriate \(\Delta \mathbf{x}_k\), \(\alpha_k\) and \(c_2\) then the BFGS update must preserve positive definiteness, as proven by derivation 9.

Derivation 9: The Sufficiency Of The Curvature Condition
\[ \Delta \mathbf{x} \times \nabla f\left(\mathbf{x} + \alpha \times \Delta \mathbf{x}\right) \geqslant c_2 \times \Delta \mathbf{x} \times \nabla f\left(\mathbf{x}\right) \]
we have
\[ \begin{align*} \alpha \times \Delta \mathbf{x} \times \nabla f\left(\mathbf{x} + \alpha \times \Delta \mathbf{x}\right) &\geqslant c_2 \times \alpha \times \Delta \mathbf{x} \times \nabla f\left(\mathbf{x}\right)\\ \alpha \times \Delta \mathbf{x} \times \nabla f\left(\mathbf{x} + \alpha \times \Delta \mathbf{x}\right) - \alpha \times \Delta \mathbf{x} \times \nabla f\left(\mathbf{x}\right) &\geqslant c_2 \times \alpha \times \Delta \mathbf{x} \times \nabla f\left(\mathbf{x}\right) - \alpha \times \Delta \mathbf{x} \times \nabla f\left(\mathbf{x}\right)\\ \alpha \times \Delta \mathbf{x} \times \left(\nabla f\left(\mathbf{x} + \alpha \times \Delta \mathbf{x}\right) - \nabla f\left(\mathbf{x}\right)\right) &\geqslant \left(c_2-1\right) \times \alpha \times \Delta \mathbf{x} \times \nabla f\left(\mathbf{x}\right) > 0 \end{align*} \]
since \(\alpha\) must be positive, \(c_2\) must lie strictly between zero and one and \(\Delta \mathbf{x} \times \nabla f\left(\mathbf{x}\right)\) must be negative since \(\Delta \mathbf{x}\) points in the direction of a minimum, and therefore given a step of
\[ \begin{align*} \Delta \mathbf{x}^\ast &= \alpha \times \Delta \mathbf{x}\\ \mathbf{y}^\ast &= \nabla f\left(\mathbf{x} + \Delta \mathbf{x}^\ast\right) - \nabla f\left(\mathbf{x}\right) \end{align*} \]
\(\mathbf{M}\) will be positive definite.

Now that we understand the theoretical foundations of BFGS quasi-Newton minimisation we shall put together an implementation in the next post.


[1] Wolfe It Down,, 2021.

[2] Wolfe, P., Convergence Conditions for Ascent Methods, SIAM Review, Vol. 11, No. 2, 1969.

[3] Found In Space,, 2021.

[4] Gill, P., Murray, W. and Wright, M., Practical Optimization, Academic Press, 1981.

[5] Are You Definitely Positive, Mr Cholesky?,, 2015.

Leave a comment