A model’s quality depends completely on the data representation used to train it. For this reason, methods that can transform input data such that it better suits a given machine learning method can improve that method’s predictive capacity. We showed that symbolic regression approaches can be competitive in the task of learning better data representations for standard ML tools [1-3]. These approaches have nice properties, including 1) the ability to represent arbitrary nonlinear relations in the data, 2) independent scaling from the number of features in the raw data, 3) the ability to produce readable transformations. On a set of 20 classification problems, an ensemble technique [2] outperformed 7 state-of-the-art ML methods trained on the raw data [2].

A particular instance of symbolic representation learning is our development of M4GP, a multi-class classification strategy that uses GP to learn representations for a nearest centroid classifier. This method has shown promise on several biomedical informatics problems [3-4], and was shown to outperform state-of-the-art methods in identifying epistasis in noisy genetics datasets [4].

  1. La Cava, W., & Moore, J. (2017). A General Feature Engineering Wrapper for Machine Learning Using epsilon-Lexicase Survival. In European Conference on Genetic Programming (pp. 80–95). Springer, Cham. link, preprint

  2. La Cava, W., & Moore, J. H. (2017). Ensemble representation learning: an analysis of fitness and survival for wrapper-based genetic programming methods. GECCO ’17 (pp. 961–968). Berlin, Germany: ACM. link, arXiv

  3. La Cava, W., Silva, S., Vanneschi, L., Spector, L., & Moore, J. (2017). Genetic Programming Representations for Multi-dimensional Feature Learning in Biomedical Classification. Applications of Evolutionary Computation (pp. 158–173). Springer, Cham. link, preprint

  4. La Cava, W., Silva, S., Vanneschi, L., Spector, L., & Moore, J. (2017). Multi-dimensional Genetic Programming for Multi-class Classification. In Review.