linear genetic programming system for symbolic regression and classification.

ellenGP is a genetic programming tool for symbolic regression and multi-class classification that incorporates epigenetic learning and uses a stack-based, linear representation.

This code formed the basis of research during my dissertation

Please note that most of the current development for ellen is happening in the ellyn repo, which is a Python-wrapped version of this code base.

There are some library dependencies, including eigen.

The files have been built in Visual Studio C++ 2010 and in linux with gcc and the intel c++ compiler.

ellenGP uses a stack-based, syntax-free, linear genome for constructing candidate equations.

It is built to include different evolutionary methods for system identification adapted from literature. The options include normal tournament selection, deterministic crowding, and age-pareto fitness selection. All algorithm choices are mangaged by one parameter file.

I’ve built the project in Visual Studio 2010 professional as well as C++ Express (which is free from Microsoft), and in linux with g++ and the intel c++ compiler using the make files. If you use VS 2010 Express, the OpenMP files (which were removed from VS 2010) need to be added to the VS path.

There are two external library dependencies:

- boost libraries - a set of multi-purpose c++ libraries,
*needed for RunTrialsMPI only* - eigen - a c++ template library for linear algebra

In addition to downloading those packages, the paths to them need to be modified in the Makefiles for ellenGP and RunTrials.

Run ellenGP like this:

```
ellenGP sampleparams.txt sampledata.txt
```

As you can see, ellenGP takes two arguments: a parameter file and a data file. The parameter file includes all of the run-time settings for your search. The data file includes all your experimental data. See the sampleparams.txt and sampledata.txt files to see how formatting works.

RunTrials will run ellenGP for many trials. It uses OpenMP to parallelize the trials. Here is the syntax:

```
RunTrials sampletrials.txt
```

RunTrials takes one input file (sampletrials.txt). The trials input file contains three columns:

```
[#trials] [parameterfile] [datafile]
There is also an MPI version, RunTrialsMPI, which uses the same syntax, but parallelizes the threads over a cluster rather than the cores of a single node (computer).
```

for example, … 100 ../in/sampleparams.txt ../in/sampledata.txt

These are the simple instructions for running ellenGP.

RunTrialsMPI is the same as RunTrials except it is written to be compiled on the clusters (the TACC cluster Stampede as well as the Umass HPCC cluster). MakefileTACC and MakefileUMG has the compilation notes. It has been built using intel icpc and the MPI compiler mvapich2 from OSU, as well as g++ with mpicxx.

Here is a comprehensive list of all of the options that you can include in the parameter file.

Setting | Default | Description | |
---|---|---|---|

g | 100 | number of generations | |

popsize | 500 | population size | |

limit_evals | 0 | limit point evals instead of number of generations | |

max_evals | 0 | max point evaluations | |

Generation Settings | |||

sel | 1 | 1: tournament 2: deterministic crowding 3: lexicase selection 4: age-fitness pareto algorithm | |

PS_sel | 1 | objectives for pareto survival. 1: age + fitness; 2: age+fitness+generality; 3: age+fitness+complexity; 4: class fitnesses (classification ONLY); 5: class fitnesses+ age (classification ONLY) | |

tourn_size | 2 | number of individuals in each tournament | |

rt_rep | 0 | rate of reproduction | |

rt_cross | 0.8 | rate of crossover | |

rt_mut | 0.2 | rate of mutation | |

cross | 3 | 1: ultra 2: one point1 3: sub-tree | |

mutate | 2 | 1: point mutation; 2: subtree mutation | |

cross_ar | 0.025 | crossover alternation rate (ultra only) | |

mut_ar | 0.025 | mutation alternation rate | |

align_dev | 0 | on or off; adds gaussian alignment deviation to crossover | |

elitism | 0 | save best individual each generation | |

stop_condition | 1 | if on, run will terminate when an fitness < 1e-6 is reached. | |

init_validate_on | 0 | initial fitness validation of starting population | |

Data Options | |||

train | 0 | split data into training and validation sets | |

train_pct | 0.5 | percent of data to be used in training | |

shuffle_data | 0 | shuffle the data before splitting into training and validation | |

pop_restart | 0 | restart run from previous population specified by pop_restart_path | |

pop_restart_path | ”” | filename of restart population with path | |

Results and Printing Options | |||

resultspath | ”” | path where results are saved | |

print_every_pop | 0 | save printout of population at every generation | |

print_genome | 0 | prints genome for visualization in paraview | |

print_novelty | 0 | print number of unique output vectors | |

print_homology | 0 | print genetic homology in programs | |

num_log_pts | 0 | number of log points to print (0 means print each generation) | |

Classification Options | |||

classification | 0 | defines a classification, rather than regression, problem | |

class_bool | 0 | interpret class labels as bit-string conversion of boolean stack output | |

class_m3gp | 0 | use mahalanobis distance classification fitness | |

class_prune | 0 | prunes the dimensions of the best individual each generation | |

Problem information | |||

intvars | none | variables in data file to use in programs | |

cvals | none | seed the initial population with certain constant values | |

seeds | none | seed partial solutions, e.g. (x+y) | |

AR | 0 | include auto-regressive output variables | |

AR_n | 1 | order of auto-regression (number of time-steps back) | |

AR_lookahead | 0 | just predict one output ahead | |

ERC | 1 | ephemeral random constants | |

ERCints | 0 | make the ERCs integer valued rather than floats | |

maxERC | 1 | ||

minERC | -1 | ||

numERC | 1 | ||

Fitness Settings | |||

fit_type | 1 | 1: mean absolute error, 2: corr, 3: combo, 4: VAF | |

norm_error | 0 | normalize error by the standard deviation of the target data being u | |

max_fit | 1.00E+20 | maximum fitness possible | |

min_fit | 1.00E-20 | minimum fitness possible | |

estimate_fitness | 0 | coevolve fitness estimators | |

FE_pop_size | 0 | fitness estimator population size | |

FE_ind_size | 0 | number of fitness cases for FE to use | |

FE_train_size | 0 | trainer population size | |

FE_train_gens | 0 | number of generations between trainer selections | |

FE_rank | 0 | use rank for FE fitness rather than error | |

estimate_generality | 0 | estimate how well the solutions generalize using the validation portion of the fitness estimator | |

G_sel | 0 | which fit_type to use to test generality | |

G_shuffle | 0 | shuffles data each generation | |

op_list | n v + - * / | available operators: n v + - * / sin cos log exp sqrt = ! < <= > >= if-then if-then-else & | |

weight_ops_on | 0 | weight the operators differently | |

op_weight | empty | weights of the operators specified in op_list | |

min_len | 3 | minimum program length | |

max_len | 20 | maximum length a program is allowed to be | |

max_len_init | max_len | option to specify different max length for initial population | |

init_trees | 0 | initialize genotypes as syntactically valid trees rather than randomized stacks | |

complex_measure | 2 | 1: genotype size 2: symbolic size 3: effective genotype size | |

Hill Climbing Settings | |||

parameters | |||

pHC_on | 0 | parameter hill climbing each generation | |

pHC_its | 1 | number of iterations | |

epigenetics | |||

eHC_on | 0 | epigenetic hill climbing | |

eHC_its | 1 | number of iterations | |

eHC_prob | 0.1 | probability of a gene being switched | |

eHC_init | 0.5 | percent of expressed genes in initial genotypes | |

eHC_slim | 0 | minimize point evaluations as much as possible | |

eHC_mut | 0 | do mutation rather than hill climbing | |

Pareto Archive Settings | |||

prto_arch_on | 0 | ||

prto_arch_size | 20 | ||

Island model | |||

islands | 0 | use multiple island populations, one for each core. | |

island_gens | 100 | number of generations between shuffling of the island populations | |

Lexicase Options | |||

lexpool | 1 | Fraction of population to use in lexicase selection events | |

lex_class | 0 | For a classification problem, use separate class fitnesses as cases | |

lex_metacases | none | Specify extra cases for selection. Options: age, complexity | |

lex_eps_std | 0 | use epsilon lexicase with eps = standard deviation of error | |

lex_eps_error | 0 | use epsilon lexicase with error-based epsilons | |

lex_eps_target | 0 | use epsilon lexicase with error-based epsilons | |

lex_eps_target_mad | 0 | use epsilon lexicase with median absolute deviation, target-based epsilons | |

lex_eps_error_mad | 0 | use epsilon lexicase with median absolute deviation, error-based epsilons | |

lex_epsilon | 0.1 | value of epsilon (ignored for mad and std versions) |

ellenGP Copyright (C) 2014 William La Cava

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License (License.txt) for more details.