Alex's Notes

Mitchell: Chapter 05: Implementing a Genetic Algorithm

Metadata

Core Ideas

Introduces implementation techniques around genetic algorithms.

When to use a GA? Rule of thumb is that they may be competitive if: The search space is large, known not to be perfectly smooth and unimodal, or is not well understood, if the fitness function is noisy, and if the task doesn’t require finding a global optimum.

Varieties of Encoding

Bit strings are the most common. One reason is historical, much of the theory is built on early work which relied on fixed length binary encodings. Extensions to non-binary encodings aren’t as well developed theoretically. However they are unwieldy for many problems.

For many applications it is more natural to use an alphabet of real values or many characters to encode chromosones. Some empirical results suggest these can work better than binary encodings depending on the problem, but there is no robust theory yet for predicting which encoding will work best.

Tree encodings have some advantages, they can be open ended as the tree can grow arbitrarily large. But this also has pitfalls, trees can grow so big as to be hard to understand or simplify.

At the time of writing, encodings were still a matter of trial and error rather than any theoretically driven affair.

Adaptive Encoding

In some ways choosing a fixed encoding at the start of the project constrains the evolutionary process too much. We don’t know how best to order bit strings ahead of time to maximize the potential of the crossover process. This is known as the “linkage problem” in the literature. We want functionally related loci to be more likely to stay together on the string under crossover, but it is not clear how this is to be done without knowing ahead of time which loci are important in useful schemas.

We can have the GA adapt the encoding at the same time. Another advantage of this is that it does not arbitrarily fix the length of the encoding, and so the complexity of the candidate solutions.

A more open-ended process might include “gene doubling” and “deleting” operations to allow the chromosone length to vary over time. A few methods are mentioned in more detail:

Inversion.

In real genetics, unlike GAs, a gene’s function is often independent of its position in the chromosome. Inverting part of the chromosome will retain much of the ‘semantics’ of the original.

To use inversion in GAs we have to find some way for the functional intepretation of an allele to be the same no matter where it appears in the string. This can be done by encoding each bit as a tuple with an index to identify how it should be evaluated. However, this causes issues with crossover (discussed on p. 160), and hasn’t been shown to produce dramatic improvements in practice.

Crossover Hot Spots

A different approach was to evolve the positions at which crossover was allowed to occur. Attached to each candidate solution in the population was a second string, a “crossover template” that had a 1 at each locus at which crossover was to take place, and 0 where crossover was not to take place. Multi-point crossover takes place at those locations, and the crossover points get inherited. The crossover points are also subject to mutation, with the hope that the combination will evolve good crossover points as well as good chromosomes.

The originators reported improved performance, but this had not been validated at time of writing.

Messy GAs

In messy GAs we build up increasingly longer, highly fit strings from well-tested shorter building blocks.

The messy GA proceeds in two phases. In a ‘primordial phase’ the population is enriched with small, promising candidate schemas. Then in a ‘juxtapositional phase’ the schemas are combined in useful ways.

There is a combinatorial explosion problem here, as outlined in the chapter, and they are not yet feasible at time of the book.

Selection Methods

After the encoding, the next big decision is how to perform selection. It has to be balanced with variation from crossover and mutation and survival of the fittest. Some common schemes are:

Roulette Wheel

Fitness proportionate selection, used in the module. One issue is that with the relatively small population sizes used in GAs the actual number of offspring allocated an individual is often far from its expected value.

An alternative to spinning the wheel N times is Baker’s “stochastic universal sampling”. Here the wheel is spun once, with N equally spaced pointers, which are used to select the N parents.

This doesn’t fix the major problems with the method though. Early in the search the fitness variance in the population is high, and a small number of individuals are much fitter than the others. Under fitness-proportionate selection these individuals will multiply quickly, prevening the GA from doing further exploration. This is the “premature convergence” problem.

It puts too much emphasis on “exploitation” of highly fit strings at the expense of exploration of other regions of the search space.

Later in the search, when individuals in the population are similar in fitness, there are no real fitness differences for selection to exploit, and evolution grinds to a halt.

So the rate of evolution depends on the variance in the population.

Sigma Scaling

GA researchers have, to address such problems, experimented with several “scaling” methods, to map ‘raw’ fitness values to expected values to make the GA less susceptible to premature convergence. One example is “sigma scaling” which keeps the selection pressure relatively constant over the run rather than depending on the fitness variances in the population.

Under sigma scaling an individual’s expected value is a function of its fitness, the population mean, and the population standard deviation. One example would be:

\[\mathrm{ExpVal}(i,t) = \begin{cases} 1 + \frac{f(i) = \bar{f}(t)}{2\sigma (t)}, & \mathrm{if } \; \sigma(t) \neq 0, \\
1.0, & \mathrm{if } \; \sigma(t) = 0. \end{cases}\]

Where i is the individual and t is the time, f(i) is the fitness of i, and \(\bar{f}(t)\) is the mean fitness of the population at time t, \(\sigma(t)\) is the standard deviation of the population fitnesses at time t. See p. 168 for nuances.

Elitism

This approach retains some of the best individuals for the next generation. Many researchers have found that elitism significantly improves performance.

Boltzmann Selection

In this approach a continuosly varying ‘temperature’ controls the rate of selection according to a preset schedule. The temperature starts high, meaning selection pressure is low and every individual has some reasonable chance of reproducing. Gradually the temperature is lowered and selection pressure increased, such that the GA narrows in on the best part of the search space.

Because of the issues and fixes needed in fitness-proportionate selection methods different alternatives rose in popularity, especially rank selection and tournament selection.

Rank Selection

The aim here is to prevent too quick convergence. Individuals are ranked by fitness and their expected value is determined by rank, not absolute fitness. SUS can then be used to choose parents.

The effect increases diversity, slowing the algorithm but ultimately leading to a more successful search.

A couple of specific means of calculating rank are mentioned on p. 170.

Tournament Selection

Rank selection requires sorting the whole population by fitness, expensive! Tournament is designed to be more efficient. Two individuals are chosen at random. A random number r between 0 and 1 is chosen. If it is below a threshold (eg 0.75) the fitter individual is selected as a parent. Otherwise the less fit individual is selected. The two are then returned to the population and can be selected again.

Steady State Selection

In this process only a few individuals are replaced in each generation. Usually a small number of the least fit.

Genetic Operators

The third big decision is the genetic operators to use, notably crossover and mutation.

Crossover

Single-point crossover is the simplest form. It has shortcomings, it cannot combine all possible schemas. Schemas with long defining lengths are likely to be destroyed (this is “positional bias”).

To reduce this problem, many practitioners usse two-point crossover. Another alternative is parametrized uniform crossover where exchange happens at each bit with a probability p (often 0.7 to 0.8). But this can be highly disruptive of any schema.

There is no general best approach, typically two-point or uniform is the norm nowadays.

Mutation

Again no general best approach to this. One of the most promising approaches is to allow the GA to adapt its own mutation and crossover rates during a search.

Alternatives

Many alternative operators have been tried, including ‘crowding’ in which a newly formed offspring replaced the individual closest to it, and related approaches to preserve several distinct peaks, rather than convergence on a single peak.

Another approach is to restrict mating, only allowing similar individuals to mate. This again creates species. Or, the opposite approach, to disallow ‘incest’ to preserve variation.

Hyper Parameters

The final decision is around hyper parameters, such as population size, crossover rate and mutation rate. These typically interact nonlinearly, and so cannot be optimized one at a time. There is no established “best” hyperparameters, and a huge literature on exploring this.

How do we evaluate them? We might choose “on-line fitness”, that is the average fitness of all individuals evaluated by the time t (ie t evaluation steps). Or “off-line fitness”, the average value, over t evaluation steps, of the best fitness that has been seen up to each evaluation step.

One early benchmark was De Jong, who found the best population size was 50-100, the best single point crossover rate was 0.6 per pair of parents, and mutation rate was 0.001 per bit.

Later a meta-GA was developed to evolve the best parameters over population size, crossover rate, mutation rate, generation gap, scaling window, and selection strategy.

This improved on-line fitness but not off-line fitness (see p. 176).

Note that smaller populations may do better in on-line evaluation since every individual counts towards the mean.

Many people think that the best approach is to have parameter values adapt in real time to the ongoing search. But it remained an open question the extent to which rates of adaptation correlated to progress and this needed much more research at time of publication.