# Found In Space

Some time ago[1] we saw how Newton's method used the derivative of a univariate scalar valued function to guide the search for an argument at which it took a specific value. A related problem is finding a vector at which a multivariate vector valued function takes one, or at least comes as close as possible to it. In particular, we should often like to fit an arbitrary parametrically defined scalar valued functional form to a set of points with possibly noisy values, much as we did using linear regression[2] to find the best fitting weighted sum of a given set of functions, and in this post we shall see how we can generalise Newton's method to solve such problems.

For a multivariate vector valued function $$\mathbf{f}$$ and a target vector value $$\mathbf{y}$$ we can express the inversion problem as the search for a vector $$\mathbf{x}^\ast$$ that minimises $$|\mathbf{f}(\mathbf{x}) - \mathbf{y}|$$, where the vertical bars represent the length of the vector between them, being the distance between $$\mathbf{f}(\mathbf{x})$$ and $$\mathbf{y}$$. Equivalently, we can seek to minimise the function
$g(\mathbf{x}) = \frac12 \times \sum_i \left(f_i(\mathbf{x}) - y_i\right)^2$
which is equal to half of the square of their distance.

To see how this relates to regression, suppose that we have a set of points $$\mathbf{x}_i$$ and associated scalar values $$y_i$$ that we wish to approximate with a function $$f_\mathbf{p}$$, where $$\mathbf{p}$$ is a vector of parameters which, much like the coefficients of a polynomial, control its particular shape within the constraints of its general form. The best set of parameters $$\mathbf{p}^\ast$$ minimises the sum of squared errors
$\sum_i \left(f_\mathbf{p}(\mathbf{x}_i) - y_i\right)^2$
which we can express as
$\sum_i \left(f_i(\mathbf{p}) - y_i\right)^2$
by defining
$f_i(\mathbf{p}) = f_\mathbf{p}(\mathbf{x}_i)$

### The Gauss-Newton Method

Given an initial point $$\mathbf{x}$$, we can find $$\mathbf{x}^\ast$$ with a step $$\boldsymbol{\delta}$$ such that $$g(\mathbf{x} + \boldsymbol{\delta})$$ is minimised. If we approximate the result with the second order multivariate Taylor series
$\hat{g}(\mathbf{x}+\boldsymbol{\delta}) = g(\mathbf{x}) + \sum_j \delta_j \times \frac{\partial g}{\partial x_j}(\mathbf{x}) + \frac12 \times \sum_j \sum_k \delta_j \times \delta_k \times \frac{\partial^2 g}{\partial x_j \partial x_k}(\mathbf{x})$
where the fractions with $$\partial$$ symbols are partial derivatives, then its minimum with respect to $$\boldsymbol{\delta}$$ should yield a step that moves closer to $$\mathbf{x}^\ast$$. That minimum is given by a $$\boldsymbol{\delta}$$ satisfying the matrix-vector equation
$\left(\mathbf{Q} + \mathbf{J}^\mathrm{T} \times \mathbf{J}\right) \times \boldsymbol{\delta} = -\mathbf{J}^\mathrm{T} \times (\mathbf{f}(\mathbf{x}) - \mathbf{y})$
where the $$\mathrm{T}$$ superscript represents the transpose of a matrix, in which the rows and columns are swapped, $$\mathbf{Q}$$ equals
$\sum_i \mathbf{H}^i \times \left(f_i(\mathbf{x}) - y_i\right)$
$$\mathbf{J}$$ is the Jacobian matrix of $$\mathbf{f}$$ having the elements
$J_{i,j} = \frac{\partial f_i}{\partial x_j}(\mathbf{x})$
and $$\mathbf{H}^i$$ is the Hessian matrix of $$f_i$$ with elements
$H^i_{j,k} = \frac{\partial^2 f_i}{\partial x_j \partial x_k}(\mathbf{x})$
as proven in derivation 1.

Derivation 1: The Minimising Step
Firstly, we have
$\frac{\partial \hat{g}}{\partial \delta_k}(\mathbf{x}+\boldsymbol{\delta}) = \frac{\partial g}{\partial x_k}(\mathbf{x}) + \sum_j \delta_j \times \frac{\partial^2 g}{\partial x_j \partial x_k}(\mathbf{x})$
and
\begin{align*} \frac{\partial g}{\partial x_k}(\mathbf{x}) &= \sum_i \frac{\partial f_i}{\partial x_k}(\mathbf{x}) \times \left(f_i(\mathbf{x}) - y_i\right)\\ \frac{\partial^2 g}{\partial x_j \partial x_k}(\mathbf{x}) &= \sum_i \frac{\partial^2 f_i}{\partial x_j \partial x_k}(\mathbf{x}) \times \left(f_i(\mathbf{x}) - y_i\right) + \frac{\partial f_i}{\partial x_k}(\mathbf{x}) \times \frac{\partial f_i}{\partial x_j}(\mathbf{x}) \end{align*}
To find the $$\boldsymbol{\delta}$$ that minimises $$\hat{g}$$ we require its derivatives to equal zero, so that
$\sum_i \frac{\partial f_i}{\partial x_k}(\mathbf{x}) \times \left(f_i(\mathbf{x}) - y_i\right) + \sum_j \delta_j \times \sum_i \frac{\partial^2 f_i}{\partial x_j \partial x_k}(\mathbf{x}) \times \left(f_i(\mathbf{x}) - y_i\right) + \frac{\partial f_i}{\partial x_k}(\mathbf{x}) \times \frac{\partial f_i}{\partial x_j}(\mathbf{x}) = 0$
Defining
\begin{align*} \mathbf{a} &= \mathbf{J}^T \times (\mathbf{f}(\mathbf{x}) - \mathbf{y})\\ \mathbf{b} &= \left(\mathbf{Q} + \mathbf{J}^\mathrm{T} \times \mathbf{J}\right) \times \boldsymbol{\delta} \end{align*}
we have
\begin{align*} a_k &= \sum_i J^\mathrm{T}_{k,i} \times \left(f_i(\mathbf{x}) - y_i\right) = \sum_i J_{i,k} \times \left(f_i(\mathbf{x}) - y_i\right)\\ b_k &= \sum_j \left(\left(Q_{k,j} + \sum_i J^\mathrm{T}_{k,i} \times J_{i,j}\right)\right) \times \delta_j = \sum_j \delta_j \times \sum_i H^i_{k,j} \times \left(f_i(\mathbf{x}) - y_i\right) + J_{i,k} \times J_{i,j} \end{align*}
and so $$\boldsymbol{\delta}$$ must satisfy
$\mathbf{b} = -\mathbf{a}$

If the second derivatives of the elements of $$\mathbf{f}$$ are small or if $$\mathbf{x}$$ is close to $$\mathbf{x}^\ast$$, so that $$\mathbf{f}(\mathbf{x})$$ is close to $$\mathbf{y}$$, then $$\mathbf{Q}$$ will be small and we can reasonably approximate the matrix-vector equation by discarding it with
$\mathbf{J}^\mathrm{T} \times \mathbf{J} \times \boldsymbol{\delta} = -\mathbf{J}^\mathrm{T} \times (\mathbf{f}(\mathbf{x}) - \mathbf{y})$
so that we only need to calculate the first derivatives, yielding
$\boldsymbol{\delta} = -\left(\mathbf{J}^\mathrm{T} \times \mathbf{J}\right)^{-1} \times \mathbf{J}^\mathrm{T} \times (\mathbf{f}(\mathbf{x}) - \mathbf{y})$
Rather than take a step of $$\boldsymbol{\delta}$$ we would perform a univariate search for a minimum in its direction, known as the Gauss-Newton direction. The Gauss-Newton method[3] iteratively repeats this process with
\begin{align*} \boldsymbol{\delta}_i &= -\left(\mathbf{J}_i^\mathrm{T} \times \mathbf{J}_i\right)^{-1} \times \mathbf{J}_i^\mathrm{T} \times (\mathbf{f}\left(\mathbf{x_i}\right) - \mathbf{y})\\ \mathbf{x}_{i+1} &= \mathbf{x}_i + \lambda_i^\ast \times \boldsymbol{\delta}_{i} \end{align*}
where $$\lambda_i^\ast$$ is the value that minimises $$\mathbf{f}$$ in the direction of $$\boldsymbol{\delta}_i$$ from $$\mathbf{x}_i$$, until the difference between $$\mathbf{f}\left(\mathbf{x}_i\right)$$ and $$\mathbf{y}$$, or the magnitude of step, is sufficiently small.

Recall that
$\left(\mathbf{J}^\mathrm{T} \times \mathbf{J}\right)^{-1} \times \mathbf{J}^\mathrm{T}$
is the left inverse of $$\mathbf{J}$$, satisfying
$\left(\left(\mathbf{J}^\mathrm{T} \times \mathbf{J}\right)^{-1} \times \mathbf{J}^\mathrm{T}\right) \times \mathbf{J} = \left(\mathbf{J}^\mathrm{T} \times \mathbf{J}\right)^{-1} \times \left(\mathbf{J}^\mathrm{T} \times \mathbf{J}\right) = \mathbf{I}$
Unfortunately, it's quite possible that $$\mathbf{J}^\mathrm{T} \times \mathbf{J}$$ will be singular, or nearly singular, in which case our ak.leftInv will be subject to significant numerical errors. We can go some way to dealing with such matrices by using the ak.stableInv of our ak.jacobiDecomposition or ak.spectralDecomposition implementations[4][5] of the eigensystem which ignores those directions along which the matrix is close to singular.

### The Levenberg-Marquardt Method

Instead of simply removing $$\mathbf{Q}$$ we can replace it with a non-negative multiple of the identity matrix
$\boldsymbol{\delta} = - \left(\lambda \times \mathbf{I} + \mathbf{J}^\mathrm{T} \times \mathbf{J}\right)^{-1} \times \mathbf{J}^\mathrm{T} \times (\mathbf{f}(\mathbf{x}) - \mathbf{y})$
to yield the Levenberg-Marquardt direction $$\boldsymbol{\delta}$$. The iterative process
\begin{align*} \boldsymbol{\delta}_i &= -\left(\lambda_i \times \mathbf{I} + \mathbf{J}_i^\mathrm{T} \times \mathbf{J}_i\right)^{-1} \times \mathbf{J}_i^\mathrm{T} \times (\mathbf{f}\left(\mathbf{x_i}\right) - \mathbf{y})\\ \mathbf{x}_{i+1} &= \mathbf{x}_i + \boldsymbol{\delta}_{i} \end{align*}
is known as the Levenberg-Marquardt method and replaces the univariate search for a minimum with adjustments to $$\lambda_i$$. Specifically, noting that
$\lim_{\lambda \to \infty}\boldsymbol{\delta} = -\frac{1}{\lambda} \times \mathbf{J}^\mathrm{T} \times (\mathbf{f}\left(\mathbf{x}\right) - \mathbf{y})$
which is an infinitesimal step along the line of steepest descent, $$\lambda_i$$ blends the Gauss-Newton direction with it and controls the step length. When $$\mathbf{x}$$ is distant from $$\mathbf{x}^\ast$$ we should expect gradient descent to be better than Gauss-Newton and when it's close we should expect the opposite to be true. We should consequently start with a relatively large value of $$\lambda$$ and reduce it as we near the target.
A simple scheme is to choose values for $$\lambda_0$$, a rate of decrease $$\rho_-$$ for when progress is being made and a rate of increase $$\rho_+$$ for when it is not. Defining
\begin{align*} a_{-1} &= \rho_-\\ a_k &= \rho_+^k \end{align*}
we can iterate from $$k$$ equal to minus one, taking trial steps with $$a_k \times \lambda_i$$ until the difference between the target value and the value of the function decreases, at which point we set $$\lambda_{i+1}$$ to $$a_k \times \lambda_i$$ and $$x_{i+1}$$ to its step's result.

Program 1 uses this approach to search for a solution to a pair of randomly generated simultaneous quadratic equations from a random starting point, marking the target point in yellow, unsuccessful steps in red and successful steps in green, printing the difference from the target value and the magnitude of the step in the console at the bottom and restarting if it converges on a local minimum other than the target.

An example of a parametrically defined scalar valued function and its derivatives with respect to the parameters is
\begin{align*} f(x) &= \frac{1}{n} \times \sum_{i=0}^{n-1} \sin\left(a_i \times x + b_i\right)\\ \frac{\partial f}{\partial a_i}(x) &= \frac{1}{n} \times x \times \cos\left(a_i \times x + b_i\right)\\ \frac{\partial f}{\partial b_i}(x) &= \frac{1}{n} \times \cos\left(a_i \times x + b_i\right) \end{align*}
Program 2 randomly chooses the parameters and a set of values at which to evaluate the function, using the Levenberg-Marquardt method to search for a new set of parameters to fit them, starting from another random choice.

Program 2: Fitting Sinusoids

Note that whilst the fit initially gets progressively better, the search will typically get trapped in a local minimum and therefore fail to pass exactly through the points.

### The Implementation Of The Levenberg-Marquart Method

The ak library's implementation is ak.levenbergInverse, defined in listing 1, which expects the function f, its Jacobian matrix of derivatives df, an initial value for $$\lambda$$ and its rates of change $$\rho_-$$ and $$\rho_+$$, along with an optional convergence threshold and maximum number of steps, defaulting to three quarters of the floating point digits and one thousand respectively.

Listing 1: ak.levenbergInverse
ak.levenbergInverse = function(f, df, lambda, rhom, rhop, threshold, steps) {
if(ak.nativeType(f)!==ak.FUNCTION_T)  {
throw new Error('invalid function in ak.levenbergInverse');
}
if(ak.nativeType(df)!==ak.FUNCTION_T) {
throw new Error('invalid derivative in ak.levenbergInverse');
}

if(ak.nativeType(lambda)!==ak.NUMBER_T || lambda<=0 || !isFinite(lambda)) {
throw new Error('invalid lambda in ak.levenbergInverse');
}
if(ak.nativeType(rhom)!==ak.NUMBER_T || !(rhom>0 && rhom<1)) {
throw new Error('invalid rho- in ak.levenbergInverse');
}
if(ak.nativeType(rhop)!==ak.NUMBER_T || rhop<=1 || !isFinite(rhop)) {
throw new Error('invalid rho+ in ak.levenbergInverse');
}

threshold = ak.nativeType(threshold)===ak.UNDEFINED_T ? Math.pow(ak.EPSILON, 0.75)
: Math.abs(threshold);
if(isNaN(threshold)) {
throw new Error('invalid convergence threshold in ak.levenbergInverse');
}

steps = ak.nativeType(steps)===ak.UNDEFINED_T ? 1000 : ak.floor(Math.abs(steps));
if(isNaN(steps)) throw new Error('invalid maximum steps in ak.levenbergInverse');

return function(y, hint) {
return solve(f, df, y, hint, lambda, rhom, rhop, threshold, steps);
};
};

function solve(f, df, y, x0, lambda, rhom, rhop, threshold, steps) {
if(ak.type(x0)!==ak.VECTOR_T) {
throw new Error('invalid hint in ak.levenbergInverse');
}
if(ak.type(y)===ak.VECTOR_T) {
return vectorSolve(f, df, y, x0, lambda, rhom, rhop, threshold, steps);
}
return scalarSolve(f, df, y, x0, lambda, rhom, rhop, threshold, steps);
}


It returns a function that expects a target value y and a hint for the starting point which defers to the solve function which requires that the latter be an ak.vector object and itself defers to the vectorSolve and scalarSolve functions depending upon whether the former is or not.
The vectorSolve function is given in listing 2 and iteratively calls the vectorStep function until either the normalised difference between the current and target values state.ady or the normalised step length state.adx fall within an averaged threshold.

Listing 2: The vectorSolve Helper Function
function vectorSolve(f, df, y, x0, lambda, rhom, rhop, threshold, steps) {
var y0 = f(x0);
var state = {x:x0, y:y0, lambda:lambda, ady:ak.diff(y0, y)};
var nx = x0.dims();
var ny = y0.dims();
var step = 0;
var a, i;

a = new Array(nx);
for(i=0;i<nx;++i) a[i] = new Array(nx);
state.dftdf = a;

do {
if(step++===steps) throw new Error('failure to converge in ak.levenbergInverse');
vectorStep(f, df, y, state, rhom, rhop, threshold, steps);
}
for(i=0;i<ny && !isNaN(state.y.at(i));++i);
return i===ny ? state.x : ak.vector(nx, ak.NaN);
}


The vectorStep function, defined in listing 3, first caches those values that don't change as $$\lambda$$ is modified during a step of the Levenberg-Marquardt method. Specifically, the Jacobian matrix of derivatives at state.x, expected to be an ak.matrix object, the difference between the current and target values, the product of the transpose of the former and the latter and the square of the former.
Note that our ak.vector type makes no distinction between column and row vectors and so we use the identity
$\mathbf{M}^\mathrm{T} \times \mathbf{x} = \mathbf{x} \times \mathbf{M}$
to calculate the product of the transpose of the Jacobian and the current difference in values.
It then uses the delta function to calculate the step for each value of $$\lambda$$ until the difference is reduced, finally saving the updated values in the state object.

Listing 3: The vectorStep Helper Function
function vectorStep(f, df, y, state, rhom, rhop, threshold, steps) {
var k = 0;

state.df = df(state.x);
state.dy = ak.sub(state.y, y);
state.dftdy = ak.mul(state.dy, state.df);
vectorSquare(state);

lambda = state.lambda*rhom;
if(lambda===0) lambda = state.lambda;
delta(state, lambda, threshold, steps);

x1 = ak.sub(state.x, state.dx);
y1 = f(x1);

if(k===steps) throw new Error('failure to converge in ak.levenbergInverse');

lambda = state.lambda*Math.pow(rhop, k++);
delta(state, lambda, threshold, steps);

x1 = ak.sub(state.x, state.dx);
y1 = f(x1);
}

state.x = x1;
state.y = y1;
state.lambda = lambda;
}


The square of the Jacobian matrix is stored in the state.dftdf array which was initialised in vectorSolve and is calculated by vectorSquare in listing 4.

Listing 4: The vectorSquare Helper Function
function vectorSquare(state) {
var df = state.df;
var dy = state.dy;
var dftdf = state.dftdf;
var nr = df.rows();
var nc = df.cols();
var r, c, s, k;

for(r=0;r<nc;++r) {
for(c=r;c<nc;++c) {
s = 0;
for(k=0;k<nr;++k) s += df.at(k, r) * df.at(k, c);
dftdf[r][c] = dftdf[c][r] = s;
}
}
}


The delta function defined in listing 5 adds $$\lambda \times \mathbf{I}$$ to a copy of the square of the Jacobian matrix and uses the ak.stableDiv overload for ak.spectralDecomposition to calculate $$\boldsymbol{\delta}$$ whilst ignoring directions in which the matrix is within the threshold of being singular.

Listing 5: The delta Helper Function
function delta(state, lambda, threshold, steps) {
var dftdf = state.dftdf;
var n = dftdf.length;
var a = new Array(n);
var i;

for(i=0;i<n;++i) {
a[i] = dftdf[i].slice(0);
a[i][i] += lambda;
}
a = ak.matrix(a);
a = ak.spectralDecomposition(a, threshold, steps);
state.dx = ak.stableDiv(state.dftdy, a, threshold);
}


If the function to be inverted is scalar valued then the Jacobian matrix is equivalent to a vector of its derivatives. We can consequently simplify the implementation by requiring df to return an ak.vector and creating new versions of the helper functions that exploit this, as begun by scalarSolve in listing 6.

Listing 6: The scalarSolve Helper Function
function scalarSolve(f, df, y, x0, lambda, rhom, rhop, threshold, steps) {
var y0 = f(x0);
var state = {x:x0, y:y0, lambda:lambda, ady:ak.diff(y0, y)};
var nx = x0.dims();
var step = 0;
var a, i;

a = new Array(nx);
for(i=0;i<nx;++i) a[i] = new Array(nx);
state.dftdf = a;

do {
if(step++===steps) throw new Error('failure to converge in ak.levenbergInverse');
scalarStep(f, df, y, state, rhom, rhop, threshold, steps);
}
return !isNaN(state.y) ? state.x : ak.vector(nx, ak.NaN);
}


The scalarStep helper function follows the form of vectorStep, although dftdy is now a scalar multiple of the vector of derivatives.

Listing 7: The scalarStep Helper Function
function scalarStep(f, df, y, state, rhom, rhop, threshold, steps) {
var k = 0;

state.df = df(state.x);
state.dy = ak.sub(state.y, y);
state.dftdy = ak.mul(state.dy, state.df);
scalarSquare(state);

lambda = state.lambda*rhom;
if(lambda===0) lambda = state.lambda;
delta(state, lambda, threshold, steps);

x1 = ak.sub(state.x, state.dx);
y1 = f(x1);

if(k===steps) throw new Error('failure to converge in ak.levenbergInverse');

lambda = state.lambda*Math.pow(rhop, k++);
delta(state, lambda, threshold, steps);

x1 = ak.sub(state.x, state.dx);
y1 = f(x1);
}

state.x = x1;
state.y = y1;
state.lambda = lambda;
}


Finally, the scalarSquare function given in listing 8 populates dftdf with the outer product of the derivative vector with itself, defined for a pair of vectors $$\mathbf{x}$$ and $$\mathbf{y}$$ as
\begin{align*} \mathbf{M} &= \mathbf{x} \otimes \mathbf{y}\\ m_{ij} &= x_i \times y_j \end{align*}

Listing 8: The scalarSquare Helper Function
function scalarSquare(state) {
var df = state.df;
var dy = state.dy;
var dftdf = state.dftdf;
var n = df.dims();
var r, c;

for(r=0;r<n;++r) {
for(c=r;c<n;++c) {
dftdf[r][c] = dftdf[c][r] = df.at(r) * df.at(c);
}
}
}


Program 3 demonstrates how to use ak.levenbergInverse, defined in LevenbergInverse.js, by repeating the fitting of the parametrically defined function from program 2, plotting the results of the initial parameters in red and those of the final in black.

Program 3: Using ak.levenbergInverse

As before, we see that the final parameters produce a better fit than those we started with but are still prone to falling into local minima and consequently failing to pass exactly though the sampled points, which is as good an observation to conclude with as any.

### References

[1] On The Shoulders Of Gradients, www.thusspakeak.com, 2014.

[2] Regressive Tendencies, www.thusspakeak.com, 2020.

[3] Gill, P., Murray, W. and Wright, M., Practical Optimization, Academic Press, 1981.

[4] Conquering The Eigen, www.thusspakeak.com, 2014.

[5] The Spectral Apparition, www.thusspakeak.com, 2020.