Annealing Down

A few years ago[1] we saw how we could search for a local minimum of a function, being a point for which it returns a lesser value than any in its immediate vicinity, by taking random steps and rejecting those that lead uphill; an algorithm that we dubbed the blindfolded hill climber. Whilst we have since seen[2][3] that we could progress towards a minimum much more rapidly by choosing the directions in which to step deterministically, there is a way that we can use random steps to yield better results.

A Spot Of Metallurgy

Metals can be made softer and more flexible by heating them then letting them slowly cool down in a process known as annealing. Physically, the heating adds sufficient energy to the atoms in the metal to allow them to break their crystalline bonds and start moving around and, as they slowly lose that energy during cooling, settle into a more regular crystalline structure. Such structures minimise the energy state of the crystal, and so it's rather tempting to use a mathematical model of the annealing process as a minimisation algorithm.

Simulated annealing[4] does so by treating the elements of the vector argument of a multivariate function as analogues of atoms, their values as their positions and the function itself as a measure of the energy state of the crystal. The algorithm proceeds by randomly perturbing the atoms' positions, measuring their new energy state and then using an analogue of temperature $$t$$ to determine the probability $$p$$ that the system will move from its current position $$\mathbf{x}_0$$ to a new one $$\mathbf{x}_1$$ with
$p = \begin{cases} e^\tfrac{f\left(\mathbf{x}_0\right)-f\left(\mathbf{x}_1\right)}{t} & f\left(x_1\right) > f\left(x_0\right)\\ 1 & f\left(x_1\right) \leqslant f\left(x_0\right) \end{cases}$
known as the Metropolis criterion. Specifically, if $$p$$ is equal to one or if an observation of a standard uniformly distributed random variable is less than $$p$$ then we accept the move. Finally, after each step the temperature is reduced by multiplying it by a cooling rate $$r$$.
Note that when $$t$$ is equal to zero the Metropolis criterion becomes
Figure 1: Rosenbrock's Function
$p = \begin{cases} 0 & f\left(x_1\right) > f\left(x_0\right)\\ 1 & f\left(x_1\right) \leqslant f\left(x_0\right) \end{cases}$
and so simulated annealing will behave in exactly the same way as the blindfolded hill climber.

To demonstrate the progress of this algorithm, we shall again turn to Rosenbrock's function
$f(x, y) = 100 (y - x^2)^2 + (1-x)^2$
which is typically used to test the efficiency of local minimisation algorithms because of its long curved valley that gently slopes down to a minimum value of zero when $$x$$ and $$y$$ are both equal to one, as illustrated by figure 1 in which the greater values of the function are plotted as brighter points.
Program 1 runs one and a half thousand simulated annealing steps on Rosenbrock's function, starting from an initial temperature of two hundred with a cooling rate of ninety nine percent, plotting the current point in green and previous points in red.

Program 1: Minimising Rosenbrock's Function

As you had no doubt expected, the algorithm frequently moves uphill and so it might seem to be, quite literally, a step backward from the blindfolded hill climber, which simply cannot. However, this behaviour allows it to escape local minima and settle upon the global minimum where the function returns its least possible value, at least theoretically. In general it is rather unlikely to do so in a reasonable number of steps but it's not unreasonable to expect it to be able to get close to one of the few best local minima and we can hand over to an efficient local minimisation algorithm to quickly home in on it.

We can demonstrate this by constructing a function with several local minima, known as a multimodal function. Sinusoidal functions are an easy way to add local minima to a function and program 2 does so by adding together a univariate quadratic and a univariate sine to yield
$f(x) = \left(0.9 \times x^2 + 0.1\right) + 0.1 \times \sin \left(8\pi \times x\right)$
plotting the current point in black and the previous points in light grey.

Program 2: Minimising A Multimodal Function

To see how this compares with the behaviour of the blindfolded hill climber, simply set t to zero and run the program again.

ak.annealMinimum

Given the similarities between simulated annealing and the blindfolded hill climber, it should come as little surprise that we shall turn to our implementation of the latter with the ak.blindfoldMinimum function for guidance on implementing the former.

So to begin, we shall implement a function to create the actual minimiser with its control parameters. In this case, that function is ak.annealMinimum, given in listing 1, and the parameters are the function to be minimised f, the cooling rate r, the maximum number of steps and a standard uniform random number generator rnd.

Listing 1: ak.annealMinimum
ak.annealMinimum = function(f, r, steps, rnd) {
if(ak.nativeType(f)!==ak.FUNCTION_T) {
throw new Error('invalid function in ak.annealMinimum');
}

r = ak.nativeType(r)===ak.UNDEFINED_T ? 0.99 : Math.abs(r);
if(!(r<=1)) throw new Error('invalid cooling rate in ak.annealMinimum');

steps = ak.nativeType(steps)===ak.UNDEFINED_T ? 1024
: ak.floor(Math.abs(steps));
if(!isFinite(steps)) {
throw new Error('invalid number of steps in ak.annealMinimum');
}

if(ak.nativeType(rnd)===ak.UNDEFINED_T) {
rnd = Math.random;
}
else if(ak.nativeType(rnd)!==ak.FUNCTION_T) {
throw new Error('invalid random number generator in ak.annealMinimum');
}

return function(x, t, d) {return minimum(f, x, t, d, r, steps, rnd);};
};

Here we are simply checking that the parameters are valid and, if any of them other than f are undefined, giving them default values before returning a function that takes a starting point x, an initial temperature t and a step length d which it forwards, together with the parameters, to the helper function minimum given in listing 2.

Listing 2: The minimum Helper Function
function minimum(f, x, t, d, r, steps, rnd) {
t = Math.abs(t);
if(isNaN(t)) throw new Error('invalid temperature in ak.annealMinimum');

if(ak.nativeType(d)===ak.UNDEFINED_T) d = (1+ak.abs(x))/64;

return ak.nativeType(x)===ak.NUMBER_T ? uniMinimum(f, x, t, d, r, steps, rnd)
: multiMinimum(f, x, t, d, r, steps, rnd);
}

This just ensures that the initial temperature is a non-negative number, provides a default value for the step length if it's undefined and switches between minimisers for univariate and multivariate functions depending upon whether the starting point is a number or not.

This rather trivial housekeeping out of the way, we're finally ready to get down to some real work with the uniMinimum implementation of the univariate simulated annealing algorithm, given in listing 3.

Listing 3: The uniMinimum Helper Function
function uniMinimum(f, x0, t, d, r, steps, rnd) {
var step = 0;
var x1, xs, f0, f1, fs;

if(ak.nativeType(x0)!==ak.NUMBER_T || !isFinite(x0)) {
throw new Error('invalid argument in ak.annealMinimum');
}

if(ak.nativeType(d)!==ak.NUMBER_T || !isFinite(d)) {
throw new Error('invalid delta in ak.annealMinimum');
}

if(isNaN(f0 = f(x0))) f0 = ak.INFINITY;

xs = x0;
fs = f0;

while(step++<steps) {
x1 = x0 + d*(2*rnd()-1);
if(isNaN(f1 = f(x1))) f1 = ak.INFINITY;

if(f1<=fs) {
x0 = xs = x1;
f0 = fs = f1;
}
else if(f1<=f0 || rnd()<Math.exp((f0-f1)/t)) {
x0 = x1;
f0 = f1;
}
t *= r;
}
if(!isFinite(fs)) fs = f(xs);
return !isNaN(fs) ? xs : ak.NaN;
}

We begin very much as we did in the equivalent helper function for ak.blinfoldMinimum did, by checking that the argument and the step length are finite numbers. We're also replacing any NaN values returned by the function with infinity so that the algorithm will always move away from the points at which they were generated.
An important difference is that, since we allow steps uphill, we might conceivably step away from the best point that we've visited and eventually settle upon a worse one; something that you will no doubt have noticed occasionally happened in program 2. It therefore pays to keep track of the best point visited during the run and the value of the function at that point, which we're doing here with xs and fs. Note that it is trivially the case that if the new point x1 is better than the best point that we've seen so far it must be better than the current one x0, so we move to the new point at the same time as we update the best one without applying the Metropolis criterion.
Finally, since we've been replacing NaN results of the function with infinity throughout, we re-evaluate it at xs if the best result is infinite to see whether it was really NaN, in which case we should return NaN from the minimiser.

The multiMinimum helper function operates in much the same fashion, as shown by listing 4.

Listing 4: The multiMinimum Helper Function
function multiMinimum(f, x, t, d, r, steps, rnd) {
var step = 0;
var n, i, x0, x1, xs, f0, f1, fs;

if(ak.type(x)!==ak.VECTOR_T || x.dims()===0) {
throw new Error('invalid argument in ak.annealMinimum');
}

n = x.dims();
for(i=0;i<n && isFinite(x.at(i));++i);
if(i<n) throw new Error('invalid argument in ak.annealMinimum');

if(ak.nativeType(d)===ak.NUMBER_T) {
d = ak.vector(n, d);
}
else if(ak.type(d)!==ak.VECTOR_T || d.dims()!==n) {
throw new Error('invalid delta in ak.annealMinimum');
}

for(i=0;i<n && isFinite(d.at(i));++i);
if(i<n) throw new Error('invalid delta in ak.annealMinimum');

d = d.toArray();
x0 = x.toArray();
x1 = new Array(n);

if(isNaN(f0 = f(x))) f0 = ak.INFINITY;

xs = x;
fs = f0;

while(step++<steps) {
for(i=0;i<n;++i) x1[i] = x0[i] + d[i]*(2*rnd()-1);
x = ak.vector(x1);
if(isNaN(f1 = f(x))) f1 = ak.INFINITY;

if(f1<=fs) {
x0 = (xs = x).toArray();
f0 = fs = f1;
}
else if(f1<=f0 || rnd()<Math.exp((f0-f1)/t)) {
x0 = x.toArray();
f0 = f1;
}
t *= r;
}
if(!isFinite(fs)) fs = f(xs);
return !isNaN(fs) ? xs : ak.vector(n, ak.NaN);
}

The only significant difference between this and uniMinimum is that the points are vectors rather than numbers and so we must take steps from each of their elements in turn, which we're making a little simpler by converting the step length and each current point to arrays. That and we must check that the argument and step length are vectors with the correct number of finite elements, converting the latter to an appropriately sized vector with equally valued elements first if it was instead a number.

Program 3 shows how to use ak.annealMinimum to minimise the multimodal function from program 2 using the same settings, marking its starting and ending points with a red and green cross respectively.

Program 3: Using ak.annealMinimum

Clearly returning the best, rather than the last, visited point has improved the likelihood that the ak.annealMinimum implementation of the algorithm, defined in AnnealMinimum.js, will identify a point close to the global minimum!

Repeated Annealing

If the temperature is reduced too quickly then simulated annealing's performance will be compromised, as you will see if you run program 4 a number of times, in which the cooling rate is set at ninety five percent.

Program 4: Simulated Quenching

Unfortunately, it's not entirely obvious what too quickly actually means for any given function, although I think that it's safe to say that ninety five percent would almost always qualify. We can, however, use this aggressive cooling rate to illustrate a fairly simple scheme that can often improve the algorithm's performance.

Figure 2: Cooling And Heating
If the annealing process has led us from the starting point to a local minimum some distance from the global minimum, then its not unreasonable to suppose that another round might lead us from it to another that is closer to that global minimum. On the assumption that we are at least a bit closer to it now than we were before, and so the task of minimisation will be slightly easier, we can do so with a lower initial temperature and a shorter step length.
If we do this repeatedly then we create a cycle of cooling and heating, as illustrated by the solid line in figure 2 compared to the dashed line showing the change in temperature from program 4.

Program 5 demonstrates the effect of repeated annealing with ten rounds of fifty steps each, instead of one round of five hundred, reducing the initial temperature and step length to ninety percent of their current values after each of them.

Program 5: Repeated Annealing

Whilst this is certainly an improvement, we now have to choose an appropriate value for the cooling rate during a round of annealing and one for the rate between them.

Choosing The Settings

Unfortunately, the effectiveness of simulated annealing doesn't only depend upon the cooling rates; poor choices for the initial temperature, the step length and its rate of reduction, the number of steps per round and the number of rounds can all also significantly impact it. Even more distressing is the fact that there is no rigorous method for choosing them so that the algorithm will get close to the global minimum within a reasonable number of evaluations of the function that we are trying to minimise.
Rather than pull them from thin air, however, we can at least derive them from some more slightly tangible indicators of the algorithm's behaviour.

First of all, we need to specify how many function evaluations we can afford to make and how many rounds we want to run, which we shall call $$n^\prime$$ and $$k$$ respectively. Given these, the number of steps per round will be
$n = \bigg\lfloor\frac{n^\prime}{k}\bigg\rfloor$
where the odd shaped brackets stand for the largest integer less than or equal to the value that they enclose.
Secondly, we need to choose how far we can step away from the initial value of the starting point in the most extreme possible cases in the first and last rounds, $$D_0$$ and $$D_{k-1}$$. From this we we can compute the initial step length with
$d_0 = \frac{D_0}{n}$
and the rate at which we shrink the step length after each round with
$r^\prime_d = \left(\frac{D_{k-1}}{D_0}\right)^\tfrac{1}{k-1}$
Note that for multivariate functions we should do this separately for each element of their vector arguments.
Thirdly, we need to decide how much of an increase in the value of the function we would accept with fifty percent probability at the start of the first and last rounds, $$\Delta_0$$ and $$\Delta_{k-1}$$, from which we can derive the initial temperature $$t_0$$ with
\begin{align*} e^{-\tfrac{\Delta_0}{t_0}} &= \tfrac12\\ -\frac{\Delta_0}{t_0} &= -\ln 2\\ t_0 &= \frac{\Delta_0}{\ln 2} \end{align*}
and the cooling rate between rounds with
$r^\prime_t = \left(\frac{\Delta_{k-1}}{\Delta_0}\right)^\tfrac{1}{k-1}$
Finally, we must decide upon the increase that we would accept with fifty percent probability on the very last step of the minimisation, $$\Delta_\infty$$, with which we can calculate the cooling rate within each round by
$r_t = \left(\frac{\Delta_\infty}{\Delta_{k-1}}\right)^\tfrac{1}{n-1}$
Now I must admit that whilst I'm personally more comfortable thinking in terms of these properties of the algorithm rather than the more abstract concepts of temperature, cooling rates and such, I still have no idea how to choose them so that it will find a near globally minimal point in a reasonable number of steps.

References

[1] It's All Downhill From Here, www.thusspakeak.com, 2014.

[2] The Tripods Are Coming!, www.thusspakeak.com, 2015.

[3] The Tripods Are Here!, www.thusspakeak.com, 2015.

[4] Kirkpatrick, S., Gelatt Jr, C. & Vecchi, M., Optimization by Simulated Annealing, Science, Vol. 220, No. 4598, pp. 671-680, 1983.