The Best Laid Schemata

| 0 Comments

We have seen how we can exploit a simple model of biological evolution, known as a genetic algorithm, to search for global maxima of functions[1], being those points at which they return their greatest values.
This model treated the function being optimised as a non-negative measure of the fitness of individuals to survive and reproduce, replacing negative results with zero, and represented their chromosomes with arrays of bits which were mapped onto its arguments by treating subsets of them as integers that were linearly mapped to floating point numbers with given lower and upper bounds. It simulated sexual reproduction by splitting pairs of the chromosomes of randomly chosen individuals at a randomly chosen position and swapping their bits from it to their ends, and mutations by flipping randomly chosen bits from the chromosomes of randomly chosen individuals. Finally, and most crucially, it set the probability that an individual would be copied into the next generation to its fitness as a proportion of the total fitness of the population, ensuring that that total fitness would tend to increase from generation to generation.
I concluded by noting that, whilst the resulting algorithm was reasonably effective, it had some problems that a theoretical analysis would reveal and that is what we shall look into in this post.

The Schema Theorem

A key result in the analysis of the genetic algorithm[2] is based upon the notion of schemata, which are strings of ones, zeros and wildcards such as
\[ \#1\#0\#11\# \]
in which \(\#\) represents a wildcard. Each schema represents the set of chromosomes whose bits are equal to its elements wherever they aren't wildcards. For example
\[ \#1\#0\#11\# = \begin{Bmatrix} 01000110,&01000111,&01001110,&01001111\\ 01100110,&01100111,&01101110,&01101111\\ 11000110,&11000111,&11001110,&11001111\\ 11100110,&11100111,&11101110,&11101111 \end{Bmatrix} \]
The order of a schema \(S\), denoted by \(o(S)\), is the number of its non-wildcard elements, which for our example is
\[ o(\#1\#0\#11\#) = 4 \]
and its length \(\delta(S)\) is the distance between its first and last non-wildcard elements
\[ \delta(\#1\#0\#11\#) = 5 \]
We can define the fitness of a schema as the average fitness of the chromosomes in the set that it defines with
\[ f(S) = \frac{\sum_{c \in S} f(c)}{\left|S\right|} \]
where \(\sum\) is the summation sign, \(\in\) means within, \(f\) is the non-negative fitness function and the vertical bars stand for the size of the set enclosed by them.
However, it is unlikely that each chromosome within that set will be represented exactly once in a population \(P\) and so, given that population, its observed fitness will instead be
\[ f_P(S) = \frac{\sum_{j \in \{i : c_i \in S\}} f\left(c_j\right)}{\left|\left\{i : c_i \in S\right\}\right|} \]
where \(c_i\) is the chromosome of the \(i^\mathrm{th}\) member of the population and
\[ \left\{i : c_i \in S\right\} \]
is the set of indices for which they are in the set.
Now the probability that a chromosome will be selected to fill any given slot in the next generation is given by
\[ \frac{f(c)}{f(P) \times |P|} \]
where \(f(P)\) is the average fitness of the population and \(|P|\) is its size. Since the selections for each slot are independent, this means that the chromosome will fill an expected number of slots of
\[ \frac{f(c)}{f(P) \times |P|} \times |P| = \frac{f(c)}{f(P)} \]
and furthermore that the expected number of slots that will be filled by chromosomes that are in the schema's set is given by
\[ \sum_{j \in \{i : c_i \in S\}} \frac{f\left(c_j\right)}{f(P)} = \frac{\sum_{j \in \{i : c_i \in S\}} f\left(c_j\right)}{f(P)} = \frac{f_P(S)}{f(P)} \times \left|\left\{i : c_i \in S\right\}\right| \]
Of course, once these chromosomes have been selected for the next generation they might be subjected to the crossover and mutation operators which could result in their being transformed into ones that are not in its set.
The proportion of them that will be affected by crossover is equal to the proportion of the population as a whole which, since we don't allow any chromosomes to be involved in more than one crossover event per generation, is simply the crossover rate \(r_c\). For each of those that are affected, the probability that the position in the chromosome, or locus, chosen for the crossover lies between the first and last non-wildcard elements of the schema is
\[ \frac{\delta(S)}{l} \]
where \(l\) is the length of the chromosome.
Finally, the probability that any given mutation will affect a particular chromosome is simply
\[ \frac{1}{|P|} \]
since every bit of every chromosome has an equal chance of being chosen. If that chromosome happens to be a member of the schema's set then the chance that the locus of mutation will correspond to one of its non-wildcard elements is
\[ \frac{o(S)}{l} \]
Putting all of this together, if there are \(k\) chromosomes from the schema's set before selection in one generation then we should expect the number before selection in the following generation, \(k^\prime\), to be bounded by
\[ \mathrm{E}\left[k^\prime\right] \geqslant \left(\frac{f_P(S)}{f(P)} \times k\right) \times \left(1 - r_c \times \frac{\delta(S)}{l}\right) \times \left(1 - \frac{1}{|P|} \times \frac{o(S)}{l}\right)^{r_m \times |P| \times l} \]
where \(r_m\) is the mutation rate, which is the proportion of the bits in the population that will be mutated. Note that the reason that this is a bound rather than an equality is that we've ignored the possibility that the crossover and mutation operators might also create one or more chromosomes from its set. That and we round down the number of crossover and mutation events to integers in the implementation so that this overestimates their potential disruptive effect upon the schema.
Now, this can be simplified to a slightly lower, or weaker, bound of
\[ \mathrm{E}\left[k^\prime\right] \geqslant k \times \frac{f_P(S)}{f(P)} \times \left(1 - r_c \times \frac{\delta(S)}{l} - r_m \times o(S)\right) \]
as shown by derivation 1.

Derivation 1: Simplifying The Expectation
From the formulation of Taylor's theorem that explicitly defines the remainder
\[ f(x+\delta) = f(x) + \delta \times f^\prime(x) + \tfrac{1}{2}\delta^2 \times f^{\prime\prime}(x) + \dots + \tfrac{1}{n!} \delta^n \times f^{(n)}(x) + R_n \]
where \(f^\prime(x)\) is the first derivative of \(f\), \(f^{\prime\prime}(x)\) is the second and \(f^{(n)}(x)\) is the \(n^\mathrm{th}\), the exclamation mark stands for the factorial and
\[ R_n = \tfrac{1}{(n+1)!} \delta^{n+1} \times f^{(n+1)}(x + \theta \times \delta) \]
for some \(\theta\) between zero and one, we have
\[ \begin{align*} f(x) &= x^c\\ f(1-\delta) &= 1 - \delta \times c + \tfrac12 \delta^2 \times c \times (c-1) \times (1-\theta \times \delta)^{c-2} \end{align*} \]
If \(\delta\) is less than or equal to one and \(c\) is greater than or equal to one then the final term must be non-negative and so this implies that
\[ \begin{align*} f(1-\delta) \geqslant 1 - \delta \times c \end{align*} \]
Applying this to our estimate of the disruptive effect of mutation yields
\[ \left(1 - \frac{1}{|P|} \times \frac{o(S)}{l}\right)^{r_m \times |P| \times l} \geqslant 1 - r_m \times o(S) \]
when
\[ r_m \times |P| \times l \geqslant 1 \]
since \(o(S)\) cannot be greater than \(l\) and \(|P|\) must be greater than zero and so
\[ \frac{1}{|P|} \times \frac{o(S)}{l} \leqslant 1 \]
as required. Note that since we round down the number of mutations, for
\[ r_m \times |P| \times l < 1 \]
there won't be any and so we should leave the expected number of chromosomes from the schema's set as it was by multiplying it by one, trivially satisfying the inequality with
\[ 1 \geqslant 1 - r_m \times o(S) \]
Finally, for non-negative \(a\) and \(b\) we have
\[ (1-a) \times (1-b) = 1-a-b + a \times b \geqslant 1-a-b \]
from which the result follows.

This is known as the schema theorem or, rather grandiosely, as the fundamental theorem of genetic algorithms.

We can demonstrate it by rewriting the generation function from the last post so that it applies the selection operator before the crossover and mutation operators, as shown by listing 1.

Listing 1: Reordering The Operators
function generation(f, population, genes, bits, lb, ub, cross, mutate) {
 var n = population.length;
 var next = new Array(n);
 var c = ak.floor(cross*n);
 var m = ak.floor(mutate*genes*bits);
 var fitness = 0;
 var i;
 
 for(i=0;i<n;++i)  fitness += population[i].fitness;
 for(i=0;i<n;++i)  next[i] = selection(population, fitness);
 for(i=1;i<c;i+=2) crossover(next[i-1], next[i]);
 for(i=0;i<m;++i)  mutation(next[ak.floor(Math.random()*n)]);
 for(i=0;i<n;++i)  evaluation(f, next[i], genes, bits, lb, ub);
 return next;
};

Program 1 applies it to the univariate fitness function
\[ f(x) = x \]
for \(x\) between zero and one, printing the number of chromosomes that belong to a randomly chosen schema's set in the current generation together with the number that the schema theorem predicts that we should expect there to be at least as many of in the next generation at each step.

Program 1: The Schema Theorem In Action

If you run it several times you'll find that the expected lower bound on the number of chromosomes from the schema's set is consistently exceeded in the following generation.

The Building Block Hypothesis

Whilst the schema theorem is undoubtedly true, it's not entirely obvious how it might account for the behaviour of the algorithm. One conclusion that we can draw is that schemata with low defining lengths and orders, known as building blocks, are unlikely to be affected by the crossover and mutation operators. We should therefore expect those whose observed fitness is sufficiently greater than the population's average to have an ever increasing number of representative individuals from generation to generation, as confirmed by program 2.

Program 2: A Building Block

The building block hypothesis proposes that the algorithm operates by combining, testing and selecting for the fittest building blocks, from which an optimal or near optimal individual's chromosome will ultimately be constructed.
Each individual's chromosome is a representative of \(2^l\) schemata since, at each locus, it matches those that have either the same bit or a wildcard. Similarly, they will each be representatives of \((l-\delta) \times 2^{\delta-1}\) building blocks with a defining length \(\delta\) of at least one, since they have \(\delta-1\) elements between their first and last non-wildcard elements and can start at \(l-\delta\) loci.
The upshot of this is that the genetic algorithm can process many more building blocks per generation than the number of chromosomes at a worst-case cost of evaluating each of those chromosomes just once, a feature known as its implicit parallelism.

The Problem With Crossover

One problem highlighted by the schema theorem is that low order, high defining length schemata can have a smaller lower bound on their expected number of chromosomes in the next generation than low order, low defining length schemata even if they have a much larger observed fitness. This is rather unfortunate given that the mapping from chromosomes to parameters of the fitness function is essentially arbitrary.
Fortunately, there are some simple variants of the crossover operator that can fix this problem.

The first thing that we can do is choose two crossover points, rather than one, swapping bits between a pair of chromosomes from the first up to and including the second. This means that both ends of the chromosomes can be exchanged intact and so we can have building blocks that wrap around them, such as
\[ 11\#\#\#\#\#0 \]
which now has an effective defining length of two instead of seven.
Now it takes no great leap of imagination to think that if two crossover points are better than one then perhaps even more will be better still, leading to \(n\) point crossover which switches between not swapping and swapping bits as each of \(n\) randomly chosen loci are reached.
Furthermore, if we switch the swapping mode multiple times at loci that were chosen more than once, as the number of crossover points increases well beyond the length of the chromosome this tends towards uniform crossover, which swaps the bits at each locus with a probability of one half. Since this will disrupt a schema if it swaps bits corresponding to at least one, but not all, of its non-wildcard elements, the probability that it will do so is given by
\[ \frac{2^{o(S)}-2}{2^{o(S)}} = 1 - \left(\tfrac12\right)^{o(S)-1} \]
since there are \(2^{o(S)}\) possible outcomes, of which one is that no bits are swapped and a second is that they all are. This means that the schema theorem for uniform crossover is
\[ \mathrm{E}\left[k^\prime\right] \geqslant k \times \frac{f_P(S)}{f(P)} \times \left(1 - r_c \times \left(1 - \left(\tfrac12\right)^{o(S)-1}\right) - r_m \times o(S)\right) \]
and its building blocks are therefore simply schemata with low orders.

The Problem With Selection

The problem with the selection operator is that it is extremely brittle; not only can it not handle negative fitnesses but it will also be seriously impacted by a constant shift in the fitness function. For example, if an individual has a fitness of two and the population has an average fitness of one, then we should expect it to be selected
\[ \frac{f(c)}{f(P)} = \frac{2}{1} = 2 \]
times for the next generation. However, if we were to instead use the fitness function
\[ f^\prime(c) = f(c)+99 \]
then we should expect it to have
\[ \frac{f^\prime(c)}{f^\prime(P)} = \frac{101}{100} = 1.01 \]
offspring.
This is clearly unacceptable but, once again, there are variants of the operator that fix the problem.

We can compensate for constant offsets and multipliers in the fitness function by linearly transforming it with
\[ f^\prime(c) = 1 + \frac{f(c)-f^-}{f^+-f^-} \times \left(f^\ast - 1\right) \]
where \(f^-\) is the fitness of the worst individual in the population and \(f^+\) is that of the best. This transforms the former to one and the latter to \(f^\ast\), so that we should expect the best individual to have \(f^\ast\) times as many offspring as the worst, enabling us to explicitly control how strongly the selection operator favours fitter individual. Note that this also fixes the problem that the selection operator has with negative fitnesses since they will be positive after the transformation.
A simpler approach is tournament selection which picks two individuals at random and copies the better of them into the next generation. This has the distinct advantage of not depending upon the specific values of the fitnesses of those individuals but only upon how they rank against each other.

The Problem With Mutation

The fitness function typically splits the chromosome into a number of binary integer genes \(g_i\) having \(n_i\) bits that are converted into the arguments of the function being optimised with
\[ x_i = x^\mathrm{lo}_i \times \left(1-\frac{g_i}{2^{n_i}}\right) + x^\mathrm{hi}_i \times \frac{g_i}{2^{n_i}} \]
where \(x^\mathrm{lo}_i\) and \(x^\mathrm{hi}_i\) are the lower and upper bounds of those arguments.
The biggest problem facing the mutation operator is the fact that it may be necessary to flip several bits to make a small change to one of them, known as the Hamming cliff. For example, to change the value of a four bit gene from seven to eight requires all four of its bits to flipped.
\[ \begin{align*} 0111 &\to 7\\ 1000 &\to 8 \end{align*} \]
This means that it can be extremely unlikely that the mutation operator will transform arguments that are close to an optimum into ones that are closer.
We can fix this by using an alternative encoding for the binary genes, known as the Grey code[3]. This shuffles the order in which sequences of bits are translated to integers so that successive values differ at only one position. For example
\[ \begin{align*} 000 &\to 0\\ 001 &\to 1\\ 011 &\to 2\\ 010 &\to 3\\ 110 &\to 4\\ 111 &\to 5\\ 101 &\to 6\\ 100 &\to 7 \end{align*} \]
To convert from Gray code to binary we simply exclusive-or each bit with all of those to its left and, if we treat a positive integer u as purely a sequence of bits, we can do this with

var mask;
for(mask=u>>>1;mask!==0;mask>>>=1) u ^= mask;
if(u<0) u += 0x100000000;

taking care to undo JavaScript's conversion of the results of bitwise operators to 32 bit signed integers, as demonstrated by program 3.

Program 3: The Gray Code

Now if the length of a Gray code is a power of two then we can more efficiently convert it to binary by first setting it to the exclusive-or of itself and the result of shifting it one bit to the right, then setting that to the result of doing the same with it shifted two bits to the right, then four and so on up to a shift of half of its bits, as demonstrated by program 4 for the 32 bit integers that JavaScript uses for bitwise operations.

Program 4: An Efficient 32 Bit Implementation

Derivation 2 shows why this is equivalent to the original implementation for Gray codes of such lengths.

Derivation 2: Why It Works
Firstly, let us assume that after the \(k^\mathrm{th}\) exclusive-or the \(i^\mathrm{th}\) most significant bit is given by
\[ b_i^k = \bigoplus_{j=0}^{2^k-1} b_{i+j}^0 \]
where the large encircled plus sign means the exclusive-or of the term that follows it for every index from that beneath to that above it.
Secondly, treating bits beyond the code's length as equalling zero, we have
\[ b_i^{k+1} = b_i^k \oplus b_{i+2^k}^k = \left(\bigoplus_{j=0}^{2^k-1} b_{i+j}^0\right) \bigoplus \left(\bigoplus_{j=0}^{2^k-1} b_{i+2^k+j}^0\right) = \bigoplus_{j=0}^{2^{k+1}-1} b_{i+j}^0 \]
and so if it holds for \(k\) then it must hold for \(k+1\).
Next, it trivially holds for \(k\) equal to zero since
\[ b_i^0 = \bigoplus_{j=0}^{2^0-1} b_{i+j}^0 = \bigoplus_{j=0}^0 b_{i+j}^0 = b_i^0 \]
Finally, if the length of the Gray code is
\[ n = 2^k \]
we have
\[ b_i^k = \bigoplus_{j=0}^{n-1} b_{i+j}^0 \]
as required.

In the next post we shall implement a genetic algorithm using these improvements.

In Conclusion

Finally, it should be noted that the building block hypothesis is by no means universally accepted as accounting for the behaviour of the genetic algorithm; I for one am a sceptic, having tried and failed to find any evidence to support it[4].
The problem is that it assumes that combinations of the fittest building blocks will be relatively likely to yield some of the fittest individuals and, unfortunately, there's no reason to believe that they will do so in general. Nevertheless, it was the inspiration for the variants of the crossover operator and so I have used it as such here.

References

[1] It's All In The Genes, www.thusspakeak.com, 2017.

[2] Holland, J., Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence, University of Michigan, 1975.

[3] Gray, F., Pulse Code Communication, US2632058A, United States Patent and Trademark Office, 1953.

[4] Harris, R., An Analysis of the Genetic Algorithm and Abstract Search Space Visualisation, Ph.D. thesis, University of Plymouth, 1996.

Leave a comment