Augmentative Topology Agents For
Open-ended Learning

Muhammad Umair Nasir

Supervisor(s):
Prof. Christopher W. Cleghorn

Dr. Steven D. James

A research report proposal submitted in partial fulfillment of the requirements for
the degree of Master of Science in the field of Artificial Intelligence

in the

School of Computer Science and Applied Mathematics

University of the Witwatersrand, Johannesburg

10 June 2023


i

Declaration

I, Muhammad Umair Nasir, declare that this proposal is my own, unaided work.
It is being submitted for the degree of Master of Science in the field of Artificial
Intelligence at the University of the Witwatersrand, Johannesburg. It has not been
submitted for any degree or examination at any other university.

Muhammad Umair Nasir

10 June 2023


ii

Abstract

We tackle the problem of open-ended learning by improving a method
that simultaneously evolves agents and increasingly challenging envi-
ronments. Unlike previous open-ended approaches that optimize agents
using a fixed neural network topology, we hypothesize that the open-
endedness and generalization can be improved by allowing agents’ con-
trollers to become more complex as they encounter more difficult envi-
ronments. Our method, Augmentative Topology EPOET (ATEP), ex-
tends the Enhanced Paired Open-ended Trailblazer (EPOET) algorithm
by allowing agents to evolve their own neural network structures over
time, adding complexity and capacity as necessary. Empirical results
demonstrate that ATEP results in open-ended and general agents capa-
ble of solving more environments than a fixed-topology baseline. We
also investigate mechanisms for transferring agents between environ-
ments and find that a species-based approach further improves the open-
endedness and generalization of agents.


iii

Acknowledgements

First and foremost I would like to acknowledge the prompt and tireless help of my
supervisor, Prof. Christopher Cleghorn, who has guided me, in a way that was
very easy for me to grasp, and have calmly listened to many of my impractical
ideas to produce a very practical one. Every single meeting was a progress towards
a more focused and cutting-edge idea. These few lines do not comprehend my
acknowledgment for Prof. Cleghorn as he does not only guide me with respect to
thesis but also keeps the discussion in a way which would lead to advices about
future career directions.

Secondly I would like to acknowledge the efforts of my co-supervisor, Dr. Steve
James. In a very short time of interaction, he has proved to have helped a lot by
navigating me towards the right direction with respect to this thesis, and research
mentality as well. I am looking forward to having a great experience from both of
the supervisors this year.

I would like to extend my special thanks to my parents and my wife, who have
provided me with unconditional support during the ups and downs from the start
of the degree. Lastly, I will extend my acknowledgements to my son, He always
comes by me when I am working to cheer me up and to discuss ideas as well.


iv

Contents

Declaration i

Abstract ii

Acknowledgements iii

List of Figures vi

List of Tables ix

1 Introduction 1
1.1 Introduction to Open-ended Learning . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research Questions, Aims, and Objectives . . . . . . . . . . . . . . . . 4

Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . 4
Research Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Research objectives . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 6
2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Neuroevolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Neuroevolution of Augmenting Topologies . . . . . . . . . . . . . . . 14
2.7 Open-ended Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.8 POET and Enhanced POET . . . . . . . . . . . . . . . . . . . . . . . . 19
2.9 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24


v
3 Research Methodology 27

3.1 Open-endedly evolving the topology of agents . . . . . . . . . . . . . 27
3.1.1 Augmentative Topology Enhanced POET . . . . . . . . . . . . 27

Fitness-Based Transfer . . . . . . . . . . . . . . . . . . . . . . . 29
Species-Based Transfer . . . . . . . . . . . . . . . . . . . . . . . 30

3.1.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . 31
Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
EPOET Framework . . . . . . . . . . . . . . . . . . . . . . . . . 33
ATEP Variants and Baselines . . . . . . . . . . . . . . . . . . . 33
Hyperparameter Selection . . . . . . . . . . . . . . . . . . . . . 34

4 Results and Discussion 35
4.1 Open-endedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Nodes complexity exploration . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Generalization Ability . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Conclusion And Future Directions 45

Bibliography 47

A Hyperparameter settings 58


vi

List of Figures

2.1 This diagram illustrates a simple NN, also called a perceptron. Yel-
low nodes represent inputs that are multiplied by the weights of the
connection it corresponds to. These are then summed together and
passed to an activation function f which produces the output. . . . . 8

2.2 This diagram illustrates a fully connected Deep NN with two hidden
layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 A general Reinforcement Learning algorithm flow. An agent takes an
action, for which it gets a reward and a different environment state.
It repeats till the episode is over. . . . . . . . . . . . . . . . . . . . . . 11

2.4 Different types of operators that can be used in a GA [47] . . . . . . . 13
2.5 Representation of Genotype and Phenotype of NEAT individual [102].

Node genes display all nodes in the network. Connection genes store
all information about the connection, such as in and out nodes, the
weight of the connection, if it is enabled or disabled, and the innova-
tion number. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6 The diagram illustrates structural mutation with the addition of con-
nection and nodes [102]. Connection genes represent the mutations.
The genotype, which is above the phenotype (or structure of the net-
work), starts with an innovation number. An innovation number is
a historical marker that is used to identify the ancestors of the node.
The genotype then contains a connection between nodes. In add con-
nection mutation, we add a connection which then gets an incremen-
tally new innovation number. For add nodes mutation, we first disable
the connection 3− > 5 then add a new node 6, which will add two
new genes 8 and 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16


vii
2.7 Figure displays crossover between two parent genomes and explains

how similar, disjoint and excess genes work [102]. From similar genes,
we will always select from the fitter parent. The disjoint and excess
genes are always inherited as they are. Innovation number defines
excess and disjoint genes. Disjoint means gene present within the
range of maximum innovation number of the other parent and ex-
cess means out of the range of maximum innovation number present
in the other parent genotype. . . . . . . . . . . . . . . . . . . . . . . . 17

2.8 A view of the 2D Bipedal Walker Hardcore Environment [115]. Possible
obstacles are stumps, gaps, steps and roughness of the surface. . . . . 20

2.9 Figure illustrates the process to calculate PATA-EC score[114]. First,
all active and archived agents are evaluated on the created environ-
ment E. The scores of the evaluations are clipped between a min-max
range. The scores are ranked and then normalized. Thus, for one en-
vironment we get a vector of the length of the number of active and
archived agents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.10 A CPPN generator (left) generating a level (right). [114] . . . . . . . . 22
2.11 Example environment produced by EPOET. [114] . . . . . . . . . . . 23

3.1 A flowchart demonstrating the flow of the ATEP framework, with
blocks in green being where ATEP differs from POET. For both EPOET
and ATEP, each environment is associated with an agent, represented
by an ES population for EPOET and a NEAT population for ATEP.
The environment images used in the chart were created by ATEP.
Refer to Appendix A for pseudocode describing the transfer mecha-
nisms used in ATEP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 A two-legged agent with a hull from OpenAI Gym’s 2D Bipedal
Walker environments [8]. . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1 Accumulated Number of Novel Environments Created and Solved
(ANNECS). To save the compute, we stop the NT- and RT-ATEP ex-
periments early, as it is clear that they perform poorly. The error bars
are standard deviation, the line represents the mean and both are cal-
culated over 5 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35


viii
4.2 Cumulative sum of the number of function evaluations, with the

Y-axis, converted to a log-scale. While ATEP requires significantly
more function evaluations, we find that its total wall-time is only 3
times more than EPOET, as the neural networks are generally smaller
and each evaluation does not take as long. . . . . . . . . . . . . . . . . 37

4.3 Mapping of fitness to nodes. We plot values every 1000 iterations,
starting at iteration number 150. Each dot represents a specific itera-
tion, as well as the mean fitness overall environments and the mean
number of nodes in the population. The distribution on top repre-
sents the distribution of nodes while the distribution on the left rep-
resents fitness distribution. . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4 Analysis with respect to the number of nodes. Figure (a) shows FNR
along iterations, (b) shows ANR along iterations, (c) shows the addi-
tion of nodes along iterations . . . . . . . . . . . . . . . . . . . . . . . 39

4.5 From left to right: NNs at iterations 0 and 2000 of when it was created. 40
4.6 From left to right: NNs at iteration 4000 and 6000 of when it was

created. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.7 Figure (A) shows each algorithm being tested on the 20 latest envi-

ronments created by all other algorithms, i.e., each algorithm is eval-
uated on 60 environments. The Y-axis shows the percentage of envi-
ronments in each category. Each test is conducted for 30 runs and the
mean scores are taken. Figure (B) shows the maximum score of the
30 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.8 Figures showing generalization capabilities. Figures (a), (b) and (c)
show agents of 80 solved environments being tested on all 80 envi-
ronments, for EPOET40x40, FBT-ATEP, and SBT-ATEP respectively.
Note that EPOET20x20 does not take part in this test as it failed to
produce 80 environments in the run. . . . . . . . . . . . . . . . . . . . 43

4.9 Action Distributions for (top row) SBT-ATEP, (middle row) FBT-ATEP,
and (bottom row) EPOET40x40. Each column represents one specific
dimension of the action array. . . . . . . . . . . . . . . . . . . . . . . . 44


ix

List of Tables

2.1 Environmental parameters for POET [115]. . . . . . . . . . . . . . . . 19

A.1 EPOET general hyperparameter settings for ATEP . . . . . . . . . . . 58
A.2 ES hyperparameter settings . . . . . . . . . . . . . . . . . . . . . . . . 58
A.3 NEAT hyperparameter settings . . . . . . . . . . . . . . . . . . . . . . 59
A.4 CPPN hyperparameter settings . . . . . . . . . . . . . . . . . . . . . . 60


1

Chapter 1

Introduction

1.1 Introduction to Open-ended Learning

Evolution has taken place for billions of years and is taking place as we speak. It
has always been open-ended toward the goals it reached. Human evolution, from
cavemen to trying to land on Mars, is open-ended as we did not think of landing
on Mars when we were cavemen. We were not thinking of creating computers
when we made vacuum tubes. There are many examples in human history alone,
and not considering the evolution of the universe itself, which directs us to believe
that we are goal-oriented beings but innovations happen open-endedly. We also
witness nature’s spectacular ability to invent through the diversity of every single
living being. We see the diversity in each species separately, from plants to different
animals to human beings as well, all of us have diversity within ourselves. This
leads us to believe that diversity and innovation comes from an ongoing process
that was not something in past but remains the same for 4 billion years, thus this
ongoing creative process is called open-ended evolution [75].

The concept of open-ended evolution has been a part of the artificial life worlds
(or ALife worlds) research field for approximately two decades. These worlds,
like Tierra [84], Avida [71], Division Blocks [95], Polyworld [119] and Chromaria
[94], have ALife creatures in them, and have goals like running around and being a
predator or prey. Induction of the concept of open-ended evolution, in ALife, was
necessary as this is how humans reached their peak of intelligence and nature, at
the peak of diversity. ALife, creating worlds through open-ended evolution, has led
researchers to think about how open-ended evolution or open-endedness can bene-
fit the AI community. Stanley, Lehman, and Soros [101] have urged researchers to
bring open-endedness in algorithms to achieve human-level intelligence by pro-
viding guidance on what must be the ingredients of these open-ended algorithms and


2
how should we expect them to behave. Stanley [99] points out the fact that open-
endedness deserves a place side-by-side with Artificial Intelligence (AI) as one of
the great computational challenges of the time. From AI’s perspective, open-ended
algorithms mean that the algorithms that are theoretically never-ending and the
solution that needs to be solved complexify over time.

Open-ended evolution truly came into the field of AI when one of the Topology
and Weight Evolving Artificial Neural Networks (TWEANNs) and NeuroEvolution (NE)
methods [36, 39, 66, 116], named NeuroEvolution of Augmenting Topologies (NEAT),
by Stanley and Miikkulainen [102] surfaced and gained popularity as it solved
complex Reinforcement Learning (RL) [107] tasks very efficiently, through evolving
topologies. This opened up ideas that we can indefinitely evolve Neural Networks
(NNs). It was early in this research space and much was yet to be discovered and
so Lehman, Stanley, et al. [56] introduced Novelty Search (NS) algorithm with the
intent to drive the research community one step closer towards open-ended evolu-
tion. The intent behind the NS algorithm was that the fitness-based Evolutionary
Algorithms (EAs) usually ignored paths that were promising but were not yielding
good results at first. Novelty search rewarded agents for finding novel paths which
were not explored before. This led to the emergence of new types of algorithms
called Quality Diversity (QD) algorithms which include diversity with an objective
sense of progress.

While QD has successfully been used in numerous domains, such as robotic
locomotion [16, 68, 108], video game playing [26] and procedural content gener-
ation [50, 25, 5], it still is not completely open-ended. One reason for this is that
the search space for phenotypical behaviour descriptors, i.e. defining a space of be-
haviours to search within, remains fixed [68]. This descriptor space is vast but lim-
ited, and at some point in time, the full spectrum of search space is sampled. This
significantly depreciates the pressure towards searching more diverse solutions.

A way to circumvent the fixed behavioural descriptors is with a coevolutionary
[76] approach which is the focus of our research. The coevolution brings the prop-
erties of multi-agent interactions as two or more populations to evolve and interact
with each other in a cooperative [117] or competitive [31] dynamics. Still, the co-
evolution will not bring open-endedness if the goals remain static. Thus Minimal
Criterion Coevolution (MCC) [7] opened up a new path by coevolving problems
and solutions. Here, MCC evolved mazes as problems and agents as solutions.
This enabled new environments to emerge from previous ones and new agents


3
that would solve them. This ongoing coupled interaction gave the possibility of
open-endedness. MCC introduced a new kind of coevolution where the popula-
tions can evolve only when meeting a minimum criterion with respect to the other
population. The maze solver (a neural network) has the minimum criteria to solve
the maze and a maze has the minimum criteria to be at least solved by one maze
solver. This leads to the continuous evolution of mazes and maze solvers without
behaviour characterization, which is a need for QD algorithms.

However, MCC had some limits; for instance, it only allows new problems if
they are solvable by individuals in the current population. This leads to only slight
increases in difficulty, and complexity which only arise randomly. Taking this into
account, Paired Open-ended Trailblazer (POET) [115] builds upon MCC, but instead
allows the existence of unsolvable environments, if it was likely that some individ-
uals could quickly learn to solve these environments. POET further innovates by
transferring agents between different environments, to increase the likelihood of
solving hard problems. While POET obtained state-of-the-art results, its diversity
slows down as it evolves for longer. Enhanced POET [114] adds improved algorith-
mic components to the base POET method, resulting in superior performance and
less stagnation. Enhanced POET, however, uses agents with fixed topology neu-
ral network controllers. While this approach works well for simple environments,
it has an eventual limit on the complexity of tasks it can solve: at some point of
complexity, the fixed topology agents may not have sufficient capacity to solve the
environments.

To address this issue, we propose Augmentative Topology Enhanced POET (ATEP),
which uses NEAT to evolve agents with variable, and potentially unbounded, net-
work topologies. We argue that fixed-topology agents will cease to solve environ-
ments after a certain level of complexity and empirically show that ATEP outper-
forms Enhanced POET (EPOET) in a standard benchmark domain. Finally, we find
that using NEAT results in improved exploration and better generalization com-
pared to Enhanced POET.

1.2 Problem Statement

Existing methods [44, 110, 114] have been implemented on a fixed topology with
optimizers like Evolutionary Strategies (ES) [89], Maximum a Posteriori Optimiza-
tion (MPO) [1], Proximal Policy Optimization (PPO) [91] etc. We see a problem here


4
that eventually, at some point of complexity, these fixed topology agents’ will cre-
ate delays to solve the environment. For a true open-ended, we need to have agents
that continuously augment their topologies, which will complexify as the environ-
ments complexify. Agents with fixed topology tend to stagnate as the environment
complexifies, which is observed in EPOET.

1.3 Research Questions, Aims, and Objectives

Research Questions

By extending our problem statement defined in section 1.2, we ask the question:
can agents that augment their topologies learn endlessly through the framework
of EPOET? Learning endlessly is the essence of Open-ended Learning algorithms,
which implies that it should not stop learning when environments are highly com-
plex. EPOET has introduced a measure to quantify open-endedness which we will
discuss in later sections.

The second question that we ask is can we use a different transfer mechanism
that aids more open-endedness and produces better-generalized agents? By gen-
eralization, we mean agents that can solve some environments with high rewards
while they can solve some environments with achieving minimum reward criteria
to solve environments and are not catastrophically failing in any environment.

Research Aims

We aim to investigate how augmentative topology neural networks improve agents’
behaviour toward increasingly complex environments. This research also aims to
provide better generalization capabilities through augmentative topology agents.

Research objectives

The objectives of the research are to improve EPOET through an algorithm that aug-
ments the topology of the agent’s neural network, such as Neuroevolution of Aug-
menting Topologies (NEAT). Another objective is to improve open-endedness and
generalization capabilities through transfers that are specifically derived through
the qualities of an algorithm that augments the topologies of agents.


5
1.4 Overview

In chapter 2, we will discuss all underlying algorithms and concepts that are needed
to understand our research in depth. Our research methodology will be discussed
in chapter 3, where we will discuss all the components of our proposed algorithm
and the experimental setup for it. It will be followed by chapter 4 which will be
about results and discussion on results. In the last chapter, chapter 5, we will end
our research by concluding our work and by giving some possible directions for
future works.


6

Chapter 2

Background

In this chapter, we will discuss the underlying algorithms that are necessary for our
research. We will go into the depth of the original variant of the algorithm but we
will leave the other variants for the readers to explore. We will briefly look into the
concept of Machine Learning, Neural Networks, and Reinforcement Learning. We
will then explain the Genetic Algorithms that make the basis of NEAT. We will then
move forward to NEAT as we use it as the optimizing algorithm. Before looking
into POET and Enhanced POET (EPOET), we will briefly go through the concepts
of Open-ended Learning.

2.1 Machine Learning

Machine Learning (ML) [64] is a field of Artificial Intelligence (AI) [118] that uses
data and experience to learn a solution to a problem. ML is not programmed to
find an explicit solution but to find a solution based on trial and error. This results
in diverse solutions from one ML algorithm. ML has been applied to various do-
mains, such as predicting stocks [79], classifying images [58], translating from one
language to another [96], learning how to play games [40], folding proteins [46],
and many more.

All application domains may be vastly diverse, but the mechanism of learning
for an ML model remains exceedingly similar. For the ML algorithm, one iteration
will mean one look at the data available. This available data is called training data.
Over the course of iterations, the algorithm improves its performance by tweaking
the proposed solution to build a model that reaches the desired goal. The tweaking
is done through a loss function that calculates how far the proposed solution is from
the desired solution. Traditionally, we separate testing data that the model has not
seen and examine how well the trained model performs on it.


7
ML approaches are divided into three main categories: Supervised learning, un-

supervised learning, and reinforcement learning. The difference depends on how
the model is being trained. Supervised Learning is a technique where the training
data available has desired outputs as well [10]. These outputs are known as labels.
The model trains to predict the label, thus improving upon it. Unsupervised Learn-
ing is a technique where we have data but no output labels [11]. The model has to
learn patterns in the data to separate groups of data that should belong under a
single label. Reinforcement Learning is a technique where the algorithm learns
through interaction with the environment [106]. We will look into it in more detail
in an upcoming section as it is one of the basic concepts needed for our research.

2.2 Neural Networks

Neural Networks (NN), also known as Artificial Neural Networks, are computing
systems that are motivated by the human brain [43]. Like the human brain, NN has
neurons that are called nodes in computational sciences. These nodes are connected
through weighted connections, that resemble synapses of the human brain. In the
human brain, an electric signal passes from one neuron to another to process a
function. To replicate this process, NNs pass numerical values through connections.

Figure 2.1 shows a simple perceptron. Perceptrons are the building blocks of
NNs. The output value of a perceptron is calculated by adding the weighted input
vector and passing it to an activation function. This simple perceptron can act as a
linear classifier where it can classify if this vector belongs to one category or another.
A collection of these perceptrons are called Multi-layer Perceptrons (MLP) or NNs.
Figure 2.2 shows a fully connected NN with two hidden layers. These hidden layers
can be scaled to any number.

Similar to perceptron, each node will receive an input vector, which will be
summed and passed to an activation function. This numerical value will now be
inserted into another vector which will then do the same process as described until
the last node. This output node will produce an output value for us. These NNs are
sometimes called Deep Neural Networks as they can have multiple hidden nodes.

Deep Neural Networks are used in a sub-field of Machine Learning called Deep
Learning (DL). DL is used for all kinds of ML problems, such as supervised learning,
unsupervised learning, and reinforcement learning. There are different types of ar-
chitectures, such as convolutional neural networks [73] and transformers [112], that


8

0

Σ

1

2

3

f

FIGURE 2.1: This diagram illustrates a simple NN, also called a percep-
tron. Yellow nodes represent inputs that are multiplied by the weights
of the connection it corresponds to. These are then summed together

and passed to an activation function f which produces the output.

 
FIGURE 2.2: This diagram illustrates a fully connected Deep NN with
two hidden layers.


9
are successfully used for tasks like object detection in an image or video [45], trans-
lation from one language to another [96], protein folding [46] and beating world
champions in complex board games like Go [37].

Conventionally, DL uses gradient descent [86] and backpropagation [85] to im-
prove the prediction of the NN. Gradient descent optimizes the weights of an NN
by incrementally finding the right weight combination of the NN, while backprop-
agation is used to find the gradient of the loss function. The loss function, as dis-
cussed in the Machine Learning section, is used to determine how incorrect the
model is. Incrementally, the loss function is moved towards the least incorrect
model. Loss functions are set to be differentiable to make backpropagation work.

NNs uses activation functions which keep the values in a desired range to get
a desired output. One such activation function is the hyperbolic tangent function
(tanh). The tanh function takes a real-valued number and squashes it to the range
between -1 and 1. In other words, it maps the input to a value between -1 and 1.
It is an S-shaped curve but is symmetric around the origin (i.e., (0,0)). We explain
tanh as we use it in this research. Mathematically, the function is defined as:

tanh(x) = (ex − e−x)/(ex + e−x) (2.1)

Similar to other activation functions, tanh is non-linear. This property allows
neural networks to learn and represent complex patterns. the tanh function is zero-
centred, making it easier for models to learn in some cases, as it can output both
positive and negative values, which can be more informative for the subsequent
layers. The derivative of tanh (used during backpropagation) is easier to compute
than many other functions. The derivative is 1− tanh2(x), which is always positive
and varies from 0 (for input values far from 0 where the function saturates at -1 or 1)
to 1 (at x=0). Thus, with the above-mentioned properties tanh becomes a powerful
activation function to use in our research.

2.3 Reinforcement Learning

Reinforcement Learning (RL) is an ML technique for an algorithm to learn through
experience in an environment and interaction with it [106]. In RL, an ML model is
known as an agent, that interacts with the environment in which it is present, with


10
the goal to maximize total rewards. Rewards are calculated through a reward func-
tion that decides which action will produce a positive reward or a negative reward
which is known as regret. An agent interacting with an environment is generally
modelled as a Markov Decision Process (MDP) [80]. Mathematically, an MDP is a
control process with discrete-time steps. An MDP consists of the following:

• A set of actions, A, taken by the agent.

• A set of environment and agent states, S.

• Pa(s, s
′
) = P(st+1 = s

′ |st = s, at = a)), where Pa is the probability of transition
at time t from stat s to s

′
, with the action a taken at time t

• A reward Ra(s, s
′
) after an action a taken to transit from s to s

′
.

Essentially, the above steps can be summarized such that at each time step t,
an agent is on a current stat st with a reward rt. The agent then chooses the next
action at from the possible actions it can take, which is then sent to the environment.
The environment progresses to a new state st+1 with the reward rt+1. Figure 2.3
shows a general RL algorithm flow. RL algorithms that use Neural Networks, as
optimizers, are called Deep RL [32]. Deep RL has been used for many complex
applications, such as autonomous driving [52], beating world champions in the
game of Go [37], creating video game bots that can generally play games [40], and
generating algorithms for faster matrix multiplications [30].

2.4 Genetic Algorithms

In recent years, metaheuristic algorithms [20] have been used to solve real-life com-
plex problems in various domains of engineering [53]. A metaheuristic algorithm
is an algorithm that attempts to find the most feasible solution out of all possi-
ble solutions to a possible optimization problem [20]. Most metaheuristic algo-
rithms are inspired by real-life biological evolutionary processes and swarm be-
haviors of many different species, usually of birds, flies, and insects [6]. These
metaheuristic algorithms are categorized into two main categories: Single-solution-
based and Population-based. Some well-known single-solution-based metaheuris-
tic algorithms are Tabu Search [35] and guided local search [113]. However, single-
solution-based algorithms are prone to get stuck in local optima. Population-based


11

FIGURE 2.3: A general Reinforcement Learning algorithm flow. An
agent takes an action, for which it gets a reward and a different envi-

ronment state. It repeats till the episode is over.

metaheuristic algorithms utilize multiple candidate solutions to search for global
optima. Well-know population-based algorithms are Genetic Algorithms (GAs)
[42], Particle Swarm Optimization [48] and Ant Colony Optimization [21]. We will
go into the depth of GAs as it is important for our research.

GAs are inspired by natural selection [27] and emulate evolution and thus are
among a field of study, called Evolutionary Algorithms or Evolutionary Compu-
tation [3]. Each chromosome (individual) passes through evolutionary operators,
such as selection, crossover, and mutation, to improve a fitness function [63], where
the strongest will survive and the weakest will be eliminated. A fitness score de-
rived from a fitness function will decide how good an individual chromosome is.
In selection, the chromosomes are selected on the basis of their fitness score for fur-
ther processing. In the crossover operator, a random set of chromosomes are cho-
sen and it changes the subsequences between chromosomes to create offsprings. In
mutation, some bits of the chromosomes will be randomly flipped, which could be
other techniques such as bits being displaced, on the basis of probability [3, 41, 120].
Another important factor to set up a GA is the representation of the chromosome,
which is called chromosome encoding.


12

Algorithm 1: Genetic Algorithm
Input : Population size N, the maximum number of iterations MaxIter and

fitness function f(.)
Generate random initial population Pk (k=1,2,3,...,N);
Compute fitness f (Pk)

foreach i < maxIter do
foreach n in N do

C1,C2 = Selection();
O1 = Crossover(P1, P2, CrossoverProbability);
O1 = Crossover(O1, MutationProbability);
NewPopulation = Add(O1);

end
NewPopulation = ReplaceOldPopulation();

end

An example of the procedure of GAs is as follows: A population (N) of n chro-
mosomes is initialized randomly. The fitness of each chromosome in X is com-
puted. Then two chromosomes from the population X are selected, let’s say C1

and C2 according to their fitness value. A crossover operator, such as single-point
crossover (or any other variant) is applied on both of the selected chromosomes
with a crossover probability, to produce a new chromosome, called offspring O1.
Then we apply a mutation operator, such as uniform mutation operator [83], on
O1 with some mutation probability. The mutated offspring is placed in the new
population. This selection, crossover, and mutation operators will be repeated on
the current population until the new population is complete, which will then be
replaced by the old population. Algorithm 1 explains the flow of a GA. There are
many types of chromosome encoding, selection, mutation, and crossover operators.
Figure 2.4 shows different types of operators a GA can use, this is not an exhaustive
list.

2.5 Neuroevolution

NeuroEvolution (NE) is a field in Evolutionary Algorithms (EAs) that applies EAs
to Neural Networks (NNs) [33]. The goal of NE is to optimize the connection


13

FIGURE 2.4: Different types of operators that can be used in a GA [47]

weights and topology of Neural Networks (NN). Similar to EAs, NE uses evolu-
tionary operators like selection, mutation and crossover, to produce better individ-
uals in the population. Similarly, the performance of these individuals are decided
through a fitness score that is calculated through a fitness function. For example, if
we use Genetic Algorithm to optimize the weights of an NN, the flow of the algo-
rithm will remain the same as Algorithm 1, just the encoding will be different.

Genetic algorithms [105], Evolutionary Strategies [89], Genetic Programming
[111], and Differential Evolution [103] are some of many EAs that have been used
for optimizing weights for NE, where topologies remain fixed. Topology and Weight
Evolving Artificial Neural Networks (TWEANNs) [67, 93, 98] use these EAs with
a different genetic encoding. GAs has been used with a bit string representation,
has been used with a graphical encoding structure, some have used only mutation
operators and argued that the crossover of topologies only has a negative impact
and leads to a loss of functionality, which we will discuss in the upcoming section.


14
Some have used indirect encoding [24], which means the genomes are represented
by another program so that we can have more expressive encoding.

NE has been successfully used any many fields such as reinforcement learning
[89], evolutionary robotics [15], procedural content generation for video games [60],
etc.

2.6 Neuroevolution of Augmenting Topologies

We will now explain the NeuroEvolution of Augmenting Topologies (NEAT) as it is
the underlying optimizing algorithm for our research. NEAT takes inspiration from
GA’s ability to use evolutionary operators and applied them to neural networks.
NEAT revolutionized the field of NeuroEvolution (NE) by introducing a few dis-
tinctive features. NEAT starts off with a population of basic NN architectures, i.e,
input neurons connected with output neurons without any hidden layers. The in-
dividual (chromosome) and also referred to as genome (or genotype) will consist
of all information regarding nodes and connections. Figure 2.7 illustrates what a
genome looks like and how the neural network (or the phenotype) will eventually
be placed.


15

FIGURE 2.5: Representation of Genotype and Phenotype of NEAT in-
dividual [102]. Node genes display all nodes in the network. Connec-
tion genes store all information about the connection, such as in and
out nodes, the weight of the connection, if it is enabled or disabled,

and the innovation number.

There are two types of mutations that can happen, adding a node or mutating
existing connections. If a node is added between two nodes (start node and end
node), the previous connection is disabled but remains in the genome and two new
connections are added: one from the start node to the newly added node with a
weight of the previous connection that is disabled now, and the second connection
from the newly added node to the end node with a weight of 1. If the connection
is added between the start and end nodes, the weight will be initialized randomly.
Weights will be perturbed in every generation. In this way, the neural network will
be complexified gradually.


16

FIGURE 2.6: The diagram illustrates structural mutation with the ad-
dition of connection and nodes [102]. Connection genes represent the
mutations. The genotype, which is above the phenotype (or struc-
ture of the network), starts with an innovation number. An innovation
number is a historical marker that is used to identify the ancestors of
the node. The genotype then contains a connection between nodes. In
add connection mutation, we add a connection which then gets an in-
crementally new innovation number. For add nodes mutation, we first
disable the connection 3− > 5 then add a new node 6, which will add

two new genes 8 and 9.

One of the major problems in NE was of Permutations Problem [82] or, more
commonly known as, Competing Convention Problem [65]. Competing conventions
means having more ways to express the solution to an optimization problem with
NN. When genomes, that are solving the same problem but are not having the same
structure, perform crossover, the resulting child will be damaged and will be non-
functional. To address this, NEAT introduced historical marking, as we can see in
Figure 2.7. Every gene in a genome is assigned a unique number. When two indi-
viduals are about to crossover the genes with similar numbers are aligned and the


17
crossover happens. The rest of the genes are called either disjoint or excess, depend-
ing on if they occur within the markings of similar genes or outside, respectively.
These excess and disjoint genes are inherited from the more fit parent.

FIGURE 2.7: Figure displays crossover between two parent genomes
and explains how similar, disjoint and excess genes work [102]. From
similar genes, we will always select from the fitter parent. The disjoint
and excess genes are always inherited as they are. Innovation number
defines excess and disjoint genes. Disjoint means gene present within
the range of maximum innovation number of the other parent and ex-
cess means out of the range of maximum innovation number present

in the other parent genotype.


18
Speciation [2] was studied in Genetic Algorithms before it was applied to NE in

NEAT for the first time. Speciation creates species in the population by measuring a
delta coefficient which calculates how similar are the individuals to each other. Ex-
plicit fitness sharing then allows individuals with similar genomes to share fitness
thus preserving innovation and creating diversity.

δ =
c1E
N

+
c2D
N

+ c3.W (2.2)

In Equation 2.2, δ is the delta coefficient, c1, c2 and c3 are coefficients that give
importance to each factor while N is the number of genes in the larger genome. E
and D and number or excess and disjoint genes. W is the average weight difference
of similar genes. δ allows us to speciate using compatibility threshold δt. δ < δt

means the genomes belong to the same species.
NEAT beats tough benchmarks like double pole balancing without velocity against

state-of-the-art NE methods, like Cellular Encoding and Enforced Subpopulations.
The authors also did a comprehensive ablation study to prove that all parts of NEAT
are necessary to get the best results. NEAT has also demonstrated superior perfor-
mance when compared to fixed topology approaches, and has been used in numer-
ous subsequent research works to great success [14, 56, 90, 97, 100].

2.7 Open-ended Learning

Open-ended Learning is a field of Machine Learning where two populations of
agents and environments co-evolve endlessly [99]. Theoretically, this algorithm
should not stagnate and should produce artifacts endlessly. Though any Machine
Learning algorithm can be used in an Open-ended Learning setting, Evolutionary
algorithms are generally used to evolve both populations.

Co-evolution is a necessary part of an Open-ended Learning system. Co-evolution
is an Evolutionary Algorithm that evolves more than one population together. The
co-evolution is set up in a collaborative or competitive setting. In collaboration, it
helps each other to achieve a common goal, while in a competitive setting, it com-
petes with each other to achieve the goal.

For the Open-ended Learning algorithm, a cooperative co-evolution named Min-
imal Criterion Co-evolution (MCC) is used [7]. MCC is a co-evolutionary algorithm


19
that states that the problem can only evolve when the solution meets some mini-
mum criterion. For instance, the research it was introduced was co-evolving agents
and environments. The environment (problem) could only evolve when one of the
agents (solution) from the population of agents could solve the environment and
the newly created environment should be solved by one of the agents. This is also
known as Problem-Solution Co-Evolution (PSCE) [22].

MCC led to the creation of theoretically true Open-ended Learning algorithms.
By definition, an Open-ended Learning algorithm is one which can run endlessly
while discovering stepping stones to novel artifacts without stagnating. The co-
evolution of environments and agents is well suited for this definition, as the en-
vironment can be evolved endlessly if the agents can keep on solving them. Al-
though any Machine Learning algorithm can be used in the Open-ended Learning
paradigm, evolutionary algorithms are more commonly used and are best suited
for it as inherently they have a high degree of exploration [3].

2.8 POET and Enhanced POET

To explain our research we first explore POET and Enhanced POET (EPOET) to
make the understanding of the baseline clear. POET focuses on evolving pairs of
agents and environments in an attempt to create specialist agents of that particu-
lar environment. POET experiments on a variant of the 2D Bipedal Walker Hard-
core environment from OpenAI Gym [8] where it uses a subset of obstacles within
the environment to create a child environment. The first environment is a basic
plain surface environment and as generations of evolution progress, the environ-
ment complexifies. Table 2.1 explains all obstacles and how much they are able to
mutate in a single generation. Figure 2.8 shows a possible terrain for the agent to
solve.

TABLE 2.1: Environmental parameters for POET [115].

Obstacle Type Stump Height Gap Width Step Numnber Roughness
Initial Value (0.0,0.4) (0.0,0.8) 1 Uniform(0,0.6)
Mutation Step 0.2 0.4 1 Uniform(0,0.6)
Max Value (5.0,5.0) (10.0,10.0) 9 10.0

POET also transfers agents from one environment to another. Transfer of agents
across environments is a crucial feature of POET as it not only helps with an agent


20
that might be stagnating in a local optimum but also helps in creating agents that
can perform better to some extent, on other environments which leads us towards
generalization of agents in POET. The transfer of an agent is attempted by first
fine-tuning the candidate agent on the target environment and then comparing the
fitness with the incumbent agent. If the fitness is higher than the incumbent agent,
POET replaces the candidate agent with the incumbent agent.

FIGURE 2.8: A view of the 2D Bipedal Walker Hardcore Environment
[115]. Possible obstacles are stumps, gaps, steps and roughness of the

surface.

There is a preset reproduction eligibility condition for a child environment to be
selected in the list of possible parents for a mutation to create the child environment
for the next generation. This reproduction eligibility ensures that the agent in child
environment has solved the environment. Once the children are selected, they are
uniformly sampled to select parents to be mutated by mutation step in Table 2.1 of
some or all of the available obstacles. Then the Minimal Criterion (MC) is applied to
filter out environments that are too easy or difficult. The environment which passes
MC and has the lowest environments difficulty score is selected. Thus, the three main
goals of POET are to:

• generates new environments from active ones.

• optimize paired agents in respective environments.

• attempt to transfer agents from one environment to another.

EPOET brings 2 algorithmic innovations, 1 innovation for producing more ex-
pressed environments and 1 for generic measuring of the extent to which the sys-
tem continues to show open-endedness. The first algorithmic innovation describes
environmental encoding (EE), which is the mapping of the various parameters of


21
the environment that creates the environmental search space. Additionally, it de-
scribes environmental characterization (EC), which is the key attribute of the en-
vironment and helps in measuring the distance between environments. In POET,
EC and EE are derived from the same set of static and hand-coded features that
make it domain-dependent, limiting the search space and thus limiting the open-
ended process. Thus, the first enhancement is to make EC generalized. Authors
introduced Performance of All Transferred Agents-EC (PATA-EC) which is motivated
by novel distinctions among the agents in the system. Four steps for PATA-EC are:
(1) Evaluating the agents on all active and archived environments that have no ex-
tra computation as it was being calculated before for the transfer mechanism, (2)
Clipping the evaluations to a maximum and minimum range as if the agent’s high
evaluation means it is already competent and low means it is an outright failure, (3)
Rank these evaluations and (4) Normalize these rankings to -0.5 to 0.5 which allows
the direct use of Euclidean-distance metric between PATA-ECs. PATA-EC revolu-
tionizes POET as it calculates environmental novelty independent of the domain.
Figure 2.9 illustrates the procedure to calculate the PATA-EC score.

FIGURE 2.9: Figure illustrates the process to calculate PATA-EC
score[114]. First, all active and archived agents are evaluated on the
created environment E. The scores of the evaluations are clipped be-
tween a min-max range. The scores are ranked and then normalized.
Thus, for one environment we get a vector of the length of the number

of active and archived agents.


22
The second algorithmic innovation was in the transfer mechanism as POET’s trans-

fer mechanism created two issues: It was computationally expensive as all transfers
needed to be fine-tuned and prone to false positives due to stochasticity in RL op-
timization. To counter these issues, authors introduced a more strict threshold that
the candidate agent’s (agent being transferred) score should exceed the incumbent
agent’s (agent being replaced) 5 previous scores which include both direct and fine-
tuning scores. This smooths out noise in ES optimization and reduces computa-
tional cost as only those will be fine-tuned that have passed transfer test.

FIGURE 2.10: A CPPN generator (left) generating a level (right). [114]

The innovation for producing more expressive environments counters the issue
in POET that it produced obstacles from the 2D Bipedal Walker Hardcore Environ-
ment and that is also a subset of it, as it limits the expressiveness of the environment
generation algorithm. POET might produce many environments but at some point
of complexity, the pressure of novelty will stagnate as there is no expressive algo-
rithm that is generating environments. To tackle this, EPOET uses Compositional
Pattern Producing Networks (CPPN) [97], a type of NN that takes in x coordinates
as inputs and produces y coordinates as output, where x is the width of the terrain
and y is the height. A CPPN uses many different types of activation functions that
produce patterns. CPPNs are evolved through NEAT to make patterns that com-
plexify procedurally as at first CPPN are initialized with basic architecture, i.e, only
input and output neurons and gradually NEAT evolves and complexifies CPPN
and thus complexifying the environment. Figure 2.10 shows a CPPN generator
taking the x coordinate from a 2D environment, applying different activation func-
tions, and producing the y coordinates. PATA-EC score then makes it possible to


23
measure diversity in the produced environments, thus the addition magnifies the
open-endedness of POET.

FIGURE 2.11: Example environment produced by EPOET. [114]

Lastly, the authors introduced accumulated number of novel environments created
and solved (ANNECS), a metric for an Open-ended Learning system that takes into
account two factors to evaluate an environment: (1) must pass the minimal crite-
rion against all agents, active and archived, generated over the entire run so far, and,
(2) must eventually be solved by the system, to prevent the system from receiving
credit for generating harder environments. If ANNECS gets higher scores as the
run proceeds indicate that the system is consistently generating meaningful new
environments. Algorithm 2 explains the overview of all steps required by EPOET
to produce environments open-endedly. Figure 2.11 shows the difficulty and ex-
pressiveness of an environment by having different micro and macro features.


24

High-level Algorithm 2: Overview of EPOET Algortithm

1. Create initial environment_agent (EA) pair with the flat environment and
randomly initialized agent.

2. Optimize agent till it passes reproduction_threshold. Check if the EA_score
lies in between minimal_criterion (mc).

3. Create environment through CPPN. Check novelty PATA-EC score.

4. Check if a transfer of agent is possible.

5. Archive oldest EA pair

6. Compute ANNECS.

7. Reiterate from step 2.

2.9 Related Work

POET has led to an explosion of new use cases such as PINSKY [19] was intro-
duced which extends the POET algorithm for game level generation and agents
that solved those levels for 2D Atari games. Quessy and Richardson [81] use un-
supervised skill discovery [9, 28, 92] in the context of POET to discover a large
repertoire of useful skills. Meier and Mujika [62] also investigated unsupervised
skill discovery through reward functions learned by neural networks. Other uses
of POET include the work by Zhou and Vanschoren [121], who obtained diverse
skills in a 3D locomotion task. POET has also been shown to aid in evolving robot
morphologies [104] and avoiding premature convergence which is often the result
when using handcrafted curricula. Norstein, Ellefsen, and Glette [70] use MAP-
Elites [68] to open-endedly create a structured repertoire of various terrains and
virtual creatures. Ølberg [72] looks into Enhanced POET and tries to find difficul-
ties in the generated environment to reduce the computational cost.

Adversarial approaches are commonly adopted when developing open-ended
algorithms. Dennis et al. [18] propose PAIRED, a learning algorithm where an ad-
versary would produce an environment based on the difference between the per-
formance of an antagonist and a protagonist agent. Domain randomization [88],
prioritized level replay [44] and Adversarially Compounding Complexity by Edit-
ing Levels (ACCEL) [78] adopt a similar adversarial approach, where teacher agents


25
produce environments and student agents solve them.

Several domains and benchmarks have been proposed with the aim of encour-
aging research into open-ended, general agents. Team et al. [110] introduce the
XLand environment, where a single agent is trained on 700k 3D games, includ-
ing single and multi-agent games, resulting in zero-shot generalization on hold-
out test environments. Later on, Team et al. [109] used an attention-based mem-
ory architecture [17] to show an agent’s emergent capabilities on held-out tasks
while training open-endedly. Barthet, Liapis, and Yannakakis [4] introduced an au-
toencoder [12, 54] and CPPN-NEAT based open-ended evolutionary algorithm to
evolve Minecraft [13, 55] buildings. They showed how differences in the training
of the autoencoders can affect the evolution and generated structures. Fan et al.
[29] create a Minecraft-based environment, MineDojo, which has numerous open-
ended tasks. They also introduced MineCLIP as an effective language-conditioned
reward function that plays the role of an automatic metric for generation tasks.
Kepes, Guttenberg, and Soros [49] introduces ChemGrid, a benchmark for general
game-playing agents based on artificial chemistry. The agents in ChemGrid can
discover new molecular recipes by making or breaking existing recipes. Gan et al.
[34] introduce the Open-ended Physics Environment (OPEn) to test learning repre-
sentations and tested many RL-based agents. Their results indicate that agents that
make use of unsupervised contrastive representation learning, and impact-driven
learning for exploration, achieve the best result.

Another related field is that of lifelong, or continual learning. Here, there is often
only one agent, which is in an environment with multiple tasks [51, 57]. The agent
can then continuously improve as it experiences new settings. A challenge here
is how to transfer knowledge between different tasks while preventing catastrophic
forgetting, i.e. where an agent performs worse on previously learned tasks after
fine-tuning on new ones [74, 77]. In particular, Rusu et al. [87] introduce Progres-
sive Neural Networks where, for each new task, the existing network is frozen and
extra capacity is added, which can learn in this new task. This allows the model
to leverage lateral connections and transfer information from previous tasks, while
not diminishing the performance on previously learned tasks. Progressive NNs do
not maintain a population of networks thus they only focus on increasing the ca-
pacity of the network and not optimizing it. Other works attempt to keep the agent
actions unchanged on these old tasks. Oswald et al. [74] do this by learning a model


26
that transforms a task identifier into a neural network. For new tasks, the loss in-
centivizes the model to output similar weights for already learned task identifiers.
Li and Hoiem [59] use a similar technique, where prior outputs from the model
should not change significantly when learning a new task.

We observe that related works have been implemented through a fixed topology
Neural Network (NN). We hypothesize that agents that can augment their NNs will
have more capacity to solve complex environments that are created in the stage of
an Open-ended Learning system. We also hypothesize that these agents will have
more generalization capabilities through different transfer mechanisms. Therefore,
we propose to extend Enhanced POET agents through NEAT which will be ex-
plained in chapter 3.


27

Chapter 3

Research Methodology

This research aims to improve Enhanced Paired Open-Ended Trailblazer (EPOET)
algorithm by introducing the Neuroevolution of Augmenting Topology (NEAT) al-
gorithm to continuously evolve the Neural Network’s (NN’s) topology so that we
may have a true Open-ended Learning algorithm. This is motivated by EPOET’s
fixed topology agents. This chapter will explain, in detail, what methods we used
to achieve our goals. We will start off by explaining the existing algorithms we used
in our experimentation, followed by an explanation of our methods.

3.1 Open-endedly evolving the topology of agents

Many of the approaches introduced in prior work have been implemented using
fixed topology approaches which motivates us to explore NEAT and the benefits it
brings to the Open-ended Learning framework. We first describe the whole process
of our proposed method, Augmentative Topology Enhanced POET (ATEP) in sub-
section 3.1.1, which will be followed by our experimental setup in subsection 3.1.2
where we will describe setting up each component of our algorithm.

3.1.1 Augmentative Topology Enhanced POET

In this section, we discuss all the building blocks of our algorithm and the different
variants we experimented with. ATEP combines EPOET with NEAT to allow the
agents’ network topologies to evolve. This means that the algorithmic steps are very
similar to EPOET, which we will be looking at, and the main differences are (1) the
optimizer used: we use NEAT to optimize the variable-topology agents whereas
EPOET used Evolution Strategies to optimize fixed-topology agents; and (2) the


28

FIGURE 3.1: A flowchart demonstrating the flow of the ATEP frame-
work, with blocks in green being where ATEP differs from POET. For
both EPOET and ATEP, each environment is associated with an agent,
represented by an ES population for EPOET and a NEAT population
for ATEP. The environment images used in the chart were created by
ATEP. Refer to Appendix A for pseudocode describing the transfer

mechanisms used in ATEP.


29
transfer mechanism, which will be discussed later in this section. The detailed flow
of ATEP that we will be talking about is also illustrated in Figure 3.1.

We start off with a simple environment and we initialize a NEAT population of
agents with it. This will be our first Environment-Agent (EA) pair. We will then op-
timize the NEAT population to solve the environment, once a preset reward thresh-
old is crossed, we check the minimum criterion to reproduce a child environment.
This minimum criterion checks if the environment was too easy or too hard for
the best agent in the population, which is checked by the reward it attains. As we
get more active environments, this minimal criterion filters out only those that are
likely to produce children that will not be unsolvable. If the minimal criterion is not
passed we optimize the NEAT population to solve the environment untill it does
meet the criterion. We then use CPPNs to produce a set of children. CPPNs are
evolved by NEAT as well so the generation of environments gets procedurally dif-
ficult. Minimal criterion is also checked on these child environments and if none of
them pass the criterion, no environment is created in the iteration.

These child environments then undergo the PATA-EC score process and then
the Euclidean distance is applied to the calculated PATA-EC score to find the most
novel environment which is selected as the new environment and a new NEAT
population is created with it. This environment goes to the optimization pipeline. If
the number of active environments is exceeded by a set limit, we archive the oldest
and compute the ANNECS score. For computing the ANNECS score, we see if the
paired populations’ best genome exceeds the reproduction threshold, we consider
it as solved and add one to the ANNECS score. After every N iteration, we try
to transfer the population from one environment to another. We have created two
types of transfer mechanisms, Fitness-Based Transfer and Species-Based Transfer.

Fitness-Based Transfer

Fitness-Based Transfer-ATEP (FBT-ATEP) is motivated by the original EPOET trans-
fer mechanism and adjusted to integrate NEAT in it. We first pick up a candidate
population, the one which is transferring, and look at the best genome. The best
genome GC is already being calculated for NEAT’s evolution. We then take the tar-
get population’s 5 previous best genomes and compare them with the GC. If the GC

is better than all 5 previous best genomes of the target population, we fine-tune GC

on the target environment for M generations and then we compare it with the best
score of the previous 5 target population’s best genomes. If it is better, we transfer


30
the whole candidate population and replace it with the target population. Algo-
rithm 3 shows the pseudocode for FBT-ATEP.

Algorithm 3: Fitness-Based Transfer
Input : Candidate population’s best individual I, a function Score(.) that

calculates the maximum of the target agent’s 5 most recent fitness
scores.

Let M = All environments - {Candidate environment}
foreach m in M do

Compute direct transfer ID;
if ID > Score(m) then

Compute fine-tuning transfer IP;
if IP > Score(m) then

Add m to T_candidates
else

Transfer not possible
end

else
Transfer not possible

end

end
Delete the whole population of T_candidates
Transfer whole candidate population to T_candidates

Species-Based Transfer

For the second transfer mechanism, we use the speciation inherent in NEAT to in-
fluence transfer. Specifically, we check if the best genome in the candidate popu-
lation is within a δ threshold (using the speciation calculation in Equation 2.2) of
any target environment’s best genome, where the best genome is already being cal-
culated for NEAT evolution. If this is the case, we transfer the candidate species
and replace the target species with it. This approach, called Species-Based Transfer
ATEP (SBT-ATEP), skips the step of comparing fitness scores and has its own ad-
vantages which we discuss in the next section. Algorithm 4 shows the pseudocode
for SBT-ATEP. Finally, we also consider random transfer (RT-ATEP) and no transfer
(NT-ATEP) to investigate whether the transfer mechanisms have a large impact on


31
the results.

Algorithm 4: Species-Based Transfer
Input : Candidate population’s best individual Ic. A function f ind_delta(.)

that calculates delta score and δthreshold.
Let M = All environments - {Candidate environment}
foreach m ∈ M do

Im = best individual of environment m
δct = find_delta(Ic, Im) using Equation 2.2
if δct ≤ δthreshold then

delete target species
Transfer candidate species to target population

else
The transfer is not possible

end

end

3.1.2 Experimental setup

Environments

Now we describe the experimental setup for ATEP, its variants, and our baselines.
POET and EPOET have exhaustively shown that they are capable of producing ex-
tremely challenging environments [114, 115] thus the experiment setup is in order
to solve complex environments with the least possible stagnation. We use a variant
of OpenAI Gym’s [8] 2D Bipedal Walker Hardcore environment as it is being used
by EPOET. The reason provided by Enhanced POET’s authors to use this environ-
ment is the simplicity as a 2D walking domain with the possibility of creating many
terrains. Thus, this motivates us to select this environment. As opted by EPOET,
we will also be considering ground roughness as the only parameter to mutate as
it creates much diversity and, again, extremely challenging environments. CPPNs
are used to mutate ground roughness parameters and are evolved through NEAT.
A node may have one of the following as an activation function: identity, sin, sig-
moid, square, and tanh. We default the activation function to identity. All our
variants and baselines have the same CPPN hyperparameter settings. Refer to Ta-
ble A.4 for detailed hyperparameter settings.


32

FIGURE 3.2: A two-legged agent with a hull from OpenAI Gym’s 2D
Bipedal Walker environments [8].

Agent

The agent’s body has a hull that is supported by 2 legs, as illustrated in Figure 3.2.
2 motor joints control the hips and knees of each leg which creates a 4 dimen-
sion action space. The agent has 10 LIDAR rangefinders for perceiving terrain and
the measurements from the LIDARs are included in the state space. Another 14
state variables including the position of joints and joint angular velocity, horizontal
and vertical speeds, hull angle, hull angular velocity, and whether legs touch the
ground, are included to make 24 dimension observation space [8].

The agent is required to navigate, from left to right within a time limit and with-
out falling over, on an environment of generated terrain of different ground rough-
ness parameters. As described in the Equation 3.1, the reward is given per step for
moving forward. We set no constraints on agents’ behaviour other than that the
agent needs to keep its hull straight and minimize the torque.

Reward =

−100, if agent falls

130 ∗ ∆x − 5 ∗ ∆hull_angle − 0.00035 ∗ applied_torque, otherwise
(3.1)

The time limit for the episode to terminate is 2000 time steps. The episode ter-
minates if either this time limit is reached, agent falls, or completes the course. In
our experiments, being solved means the agent reaches the far end of the environ-
ment and obtains a minimum score of 200. In our observations, backed with EPOET
experiments, 200 rewards are enough to ensure that the agent is efficient.


33
We use NEAT as the algorithm to augment the topology of our agents. We have

a population size of 1000, which is sufficient to produce good results as per our
experiments. We start off with a simple structure of 24 inputs (observation space)
and 4 outputs (action space) for each individual. All nodes have tanh as the acti-
vation function. We crossover individuals with a probability of 0.3 of performing
crossover. We mutate connections with a probability of 0.85 and add nodes with
a probability of 0.15. For detailed hyperparameter settings for NEAT, refer to Ta-
ble A.3. Settings for all our variants remain the same.

EPOET Framework

For the EPOET framework, we check if the transfer is possible every 25 iterations.
We try to create a new environment every 150 iterations. The reproduction thresh-
old is set to 200 rewards and the minimum criterion is set to be in a range of 50
to 340 rewards. We archive environment-agent pairs only if we have 20 active en-
vironments. All these settings are fixed across all ATEP variants and EPOET base-
lines. Refer to Table A.2 for detailed hyperparameter settings for the general EPOET
framework.

ATEP Variants and Baselines

For our baselines, first, we take the original controller, which has 2 hidden layers, 40
nodes each, and we call it EPOET40x40. For the second baseline, we shrink the orig-
inal controller to have 20 nodes in each layer and call it EPOET20x20. EPOET20x20
is created to investigate our hypothesis on different levels of fixed topology agents,
where we state that a fixed topology agent will cease to learn at some point of com-
plexity, thus different controller sizes of fixed topologies will help us to prove our
hypothesis. Refer to Table A.2 for detailed hyperparameter settings for Evolution-
ary Strategies for our baseline controllers. For our variants, we have Species-Based
Transfer ATEP and Fitness-Based Transfer ATEP (SBT- and FBT-ATEP), which are
discussed earlier, and we add two more variants which are also based on trans-
fer mechanisms, named Random Transfer ATEP (RT-ATEP) and No Transfer ATEP
(NT-ATEP) which are added to show that transfer mechanism is an important part
of Open-Ended algorithm. For RT-ATEP, if we have two EA pairs we transfer in
both of the EAs and if we have more, we randomly take 3 EA pairs and transfer
them to 3 random EA pairs and replace the population.


34
Hyperparameter Selection

All the hyperparameters belonging to EPOET were inherited from EPOET research,
except the number of active environments which was reduced to 5 because of com-
putational restrictions. All NEAT-related hyperparameters were selected after ex-
tensive experimentation and the best-performing hyperparameters were selected.


35

Chapter 4

Results and Discussion

In this chapter, we discuss and analyze our results. We break the results into 3 differ-
ent categories: open-endedness, nodes complexity exploration, and generalization
ability. All results are gathered based on 5 seeds due to the expensive computa-
tional load, with each algorithm requiring approximately 50, 000 to 200, 000 CPU
hours for a single run. EPOET20x20 required the least amount of computation,
while SBT-ATEP required the most. Each algorithm was run in parallel on a clus-
ter consisting of 264 Intel Xeon cores, with the runtime ranging between 10 and 30
days. Each algorithm had a maximum of a 1012 function evaluation per run. One
function evaluation corresponds to a single episode of at most 2000 timesteps long.

FIGURE 4.1: Accumulated Number of Novel Environments Created
and Solved (ANNECS). To save the compute, we stop the NT- and RT-
ATEP experiments early, as it is clear that they perform poorly. The
error bars are standard deviation, the line represents the mean and

both are calculated over 5 runs.


36
4.1 Open-endedness

As mentioned earlier, Wang et al. [114] introduced the ANNECS metric to capture
the open-endedness of an algorithm; we take it as our most important score to judge
which algorithm performs better on complex environments.

Figure 4.1 shows the ANNECS score as a function of training time. We see that
there is a significant difference between EPOET20x20 and FBT- and SBT-ATEP, indi-
cating that the small network results in solving fewer environments. EPOET40x40
performs substantially better than EPOET20x20 and is competitive with ATEP early
on during training. The rate of increase in ANNECS, however, does decrease after
about 13k iterations, whereas ATEP increases at a consistent rate. This substantiates
our hypothesis that fixed topology agents will start stagnating at some level of en-
vironmental complexity, due to capacity issues. While we can improve the results
by increasing the size of the network, that will merely delay the onset of stagnation.

FBT-ATEP outperforms EPOET40x40, although it also slows down slightly as
time progresses. This is due to replacing the entire target population with the trans-
ferred population, which may eliminate all useful skills learned by the target pop-
ulation. SBT-ATEP, on the other hand, only replaces a single species that is close to
the candidate species, leaving the rest of the population intact. We also find that
SBT-ATEP has negligible delays in solving environments and, even though it per-
formed similarly to FBT-ATEP and EPOET40x40 early on during training, it starts to
outperform these in the second half of the experiment. This, as we will show later,
is partly due to SBT-ATEP exploring more actions. We further note that the varia-
tions using no transfer (NT-ATEP) or random transfer (RT-ATEP) perform poorly,
indicating that intelligent transfer mechanisms are necessary.

Although ATEP outperforms EPOET, it is more computationally expensive, as
measured by the number of function evaluations. One function evaluation means
one individual being evaluated on an environment. SBT-ATEP has the most func-
tion evaluations since once a species transfers from one population to another, it
becomes highly probable that it can transfer in the opposite direction because it
may now be within the δthreshold range. This increases the population size, resulting
in more function evaluations. The tradeoff here is of function evaluations to perfor-
mance, which is justified as the performance confirms our hypothesis. Figure 4.2
displays the total number of function evaluations.


37

FBT-ATEP SBT-ATEP E-POET40x40 E-POET20x20
0

2

4

6

8

10

Models

lo
g1

0(
Fu

nc
tio

n 
Ev

al
ua

tio
ns

)

FIGURE 4.2: Cumulative sum of the number of function evaluations,
with the Y-axis, converted to a log-scale. While ATEP requires signif-
icantly more function evaluations, we find that its total wall-time is
only 3 times more than EPOET, as the neural networks are generally

smaller and each evaluation does not take as long.

4.2 Nodes complexity exploration

We have now shown that SBT-ATEP outperforms all of the other tested methods
based on the ANNECS score. We also find that it generally uses a smaller neural
network with fewer nodes than the other algorithms. Figure 4.3 shows the number
of nodes and corresponding fitness value for each algorithm. We can see that SBT-
ATEP generally has a high fitness, but fewer nodes than the other approaches. This
is echoed in Figure 4.4c, where SBT-ATEP has the least number of nodes for most
of the experiment, although it gradually adds nodes and complexity. FBT-ATEP,
on the other hand, adds nodes very rapidly. This again indicates that the transfer
mechanism in EPOET is critical.


38

0 20 40 60 80 100
Nodes

0

50

100

150

200

250

300

Fit
ne

ss

Algorithm
FBT-ATEP
SBT-ATEP
EPOET40x40
EPOET20x20

FIGURE 4.3: Mapping of fitness to nodes. We plot values every 1000
iterations, starting at iteration number 150. Each dot represents a spe-
cific iteration, as well as the mean fitness overall environments and
the mean number of nodes in the population. The distribution on top
represents the distribution of nodes while the distribution on the left

represents fitness distribution.

Inspired by this, we further look into a simple Fitness to Nodes ratio (FNR) metric,
shown in Figure 4.4a, and find that SBT-ATEP outperforms all other algorithms on
this metric for the majority of the run. This indicates that SBT-ATEP outperforms all
other algorithms on a per-environment basis while using fewer nodes. This leads
us to believe that a better-curated transfer mechanism, based on SBT-ATEP, will
sustain the FNR for longer runs.

Furthermore, in Figure 4.4b, we calculate an ANNECS to Nodes ratio (ANR) met-
ric with the intent to observe the role of nodes in the open-endedness of the agents,
i.e. to have the ability to complexify over time. We observe that SBT-ATEP per-
forms significantly better than the other models. FBT-ATEP has the lowest ANR, as
it adds nodes much faster than the rate of increase in ANNECS.


39

5k 10k 15k 20k

0

20

40

60

80

Iteration

F
it

n
e
s
s
 t

o
 n

o
d
e
s
 r

a
t
io

(A)

5k 10k 15k 20k

0

0.5

1

1.5

2

2.5

3

3.5

Iteration

A
N

N
E
C

S
 t

o
 n

o
d
e
s
 r

a
t
io

(B)

5k 10k 15k 20k
0

20

40

60

80

100
Algorithm

FBT-ATEP
SBT-ATEP
EPOET40x40
EPOET20x20

Iteration

N
od

es

(C)

FIGURE 4.4: Analysis with respect to the number of nodes. Figure (a)
shows FNR along iterations, (b) shows ANR along iterations, (c) shows

the addition of nodes along iterations

Lastly, we explore how the evolved network’s complexity emerges by looking
into the network itself. NNs are illustrated in Figures 4.5 and 4.6 is the evolution of
the best individual from a later stage environment. Yellow nodes represent inputs,
aqua represents hidden nodes, and white maps to outputs. Figure 4.5a shows the
start of the NN without any evolution operator applied. Figure 4.5b is the state af-
ter 2000 iterations. It shows to have a few nodes in hidden layer one and a couple
of nodes in hidden layer two. Complexity has not yet emerged as can be observed.
Figure 4.6a is the NN after 4000 iterations. We can observe emerging complexity.
Some residual connections can be seen, for example, node 5, an input node con-
nects to hidden node 30, but also connects to an output node 26. Figure 4.6b is the
evolved network that became the best individual in the population, after 6000 iter-
ations of when it was created. Here complexity is observed in many levels. Firstly,
input node 2 connects to input node 0, whereas node 46 appears to be having a
forward pass to input 2 and 0. Secondly, we observe an interesting small block of
6 nodes conventionally fully connected, input 22 and 23 connected to 40 and 42
which further connects to 41 and 43, which connects to the outputs. The overall
depth of the network seems to be variable.


40

(A) (B)

FIGURE 4.5: From left to right: NNs at iterations 0 and 2000 of when it
was created.


41

(A) (B)

FIGURE 4.6: From left to right: NNs at iteration 4000 and 6000 of when
it was created.


42
4.3 Generalization Ability

We next evaluate the generalization ability of our open-ended agents, as Team et
al. [110] has shown that these agents have the potential to generalize to new un-
seen environments. To concretely test this, we first take the 20 latest environments
from each method. For each environment, we take the latest agent that could solve
this environment from the method under consideration. Each of these agents is
now evaluated on the selected environments from the other methods (60 in total).
We perform 30 runs per environment-agent pair and calculate the mean and maxi-
mum of the rewards. We split the results into three categories: environments with
fitness scores above 300, between 200 and 300, and below 200. Scores below 200
indicate that the environment has not been solved by the agent. Figure 4.7b shows
the performance of each method when evaluated on the 60 other environments. We
observe that SBT-ATEP outperforms all other models, with only 10% of the environ-
ments remaining unsolved when we take the mean score. When we look at the max
score, we observe SBT-ATEP solves all environments at least once in the 30 runs we
look at. This may be the stochastic behavior but it also indicates towards possibili-
ties of powerful generalization capabilities if we have the right transfer mechanism.

SBT FBT EPOET40x40 EPOET20x20
0

20

40

60

80

100 Fitness
Above 300
200-300
Below 200

Algorithm

En
vi

ro
nm

en
ts

(A)

SBT FBT EPOET40x40 EPOET20x20
0

20

40

60

80

100 Fitness
Above 300
200-300
Below 200

Algorithm

En
vi

ro
nm

en
ts

(B)

FIGURE 4.7: Figure (A) shows each algorithm being tested on the 20
latest environments created by all other algorithms, i.e., each algorithm
is evaluated on 60 environments. The Y-axis shows the percentage of
environments in each category. Each test is conducted for 30 runs and
the mean scores are taken. Figure (B) shows the maximum score of the

30 runs.

Secondly, we test the generalization capabilities of agents on all of the environ-
ments created by their own algorithms. We exclude EPOET20x20 as it fails to solve
80 environments in the whole run. We take into account 80 environments that were


43
solved by the model itself and observe how each agent performs on all of them. Fig-
ures 4.8a, 4.8b and 4.8c show the results. Here, early-stage agents perform worse
and late-stage agents are shown to have generalization abilities on previously un-
seen landscapes. The transfer mechanism plays a key role in this generalization, as
it exposes agents to more environments. Despite not having seen all environments,
late-stage agents generalize much better. SBT-ATEP generalizes the best, with the
lowest proportion of unsolved environments, in contrast to the lower-performing
EPOET40x40 and FBT-ATEP.

20 40 60 80
0

20

40

60

80 Fitness
Above 300
200 to 300
Below 200

Agents

En
vi

ro
nm

en
ts

(A) EPOET40x40

20 40 60 80
0

20

40

60

80 Fitness
Above 300
200 to 300
Below 200

Agents

En
vi

ro
nm

en
ts

(B) FBT-ATEP

20 40 60 80
0

20

40

60

80 Fitness
Above 300
200 to 300
Below 200

Agents

En
vi

ro
nm

en
ts

(C) SBT-ATEP

FIGURE 4.8: Figures showing generalization capabilities. Figures (a),
(b) and (c) show agents of 80 solved environments being tested on all
80 environments, for EPOET40x40, FBT-ATEP, and SBT-ATEP respec-
tively. Note that EPOET20x20 does not take part in this test as it failed

to produce 80 environments in the run.

Finally, we briefly investigate potential reasons why SBT-ATEP outperforms
FBT-ATEP and EPOET40x40. We find that SBT-ATEP explores more actions, as it
only transfers a single species instead of replacing the target population as is done
by FBT-ATEP. This allows the new species to complement the actions that were al-
ready explored by the existing population. Figure 4.9 shows the action distribution


44
of each action for SBT-ATEP, FBT-ATEP, and EPOET40x40.

FIGURE 4.9: Action Distributions for (top row) SBT-ATEP, (middle
row) FBT-ATEP, and (bottom row) EPOET40x40. Each column repre-

sents one specific dimension of the action array.


45

Chapter 5

Conclusion And Future Directions

Our research investigated the effect of having an augmentative topology agent on
an Open-ended Learning algorithm’s performance. Firstly, we asked the question
that can augmentative topology agents learn endlessly through the framework of
Enhanced POET (EPOET)? We showed that this is indeed the case by introduc-
ing Augmentative Topology Enhanced POET (ATEP) and by comparing it with
EPOET, we found that our method outperforms EPOET on open-endedness via
All New and Novel Environments Created and Solved (ANNECS) score. Secondly,
we asked the question can we create a different transfer mechanism that aids more
open-endedness and produces more generalized agents? We answer this question
by experimenting with two different transfer mechanisms based on ATEP: Fitness-
Based Transfer ATEP (FBT-ATEP) and Species-Based Transfer ATEP (SBT-ATEP).
FBT-ATEP, which is inspired by EPOETs’ transfer mechanism, slightly outperforms
EPOET40x40 on open-endedness, which answers our first question, and performs
as well as EPOET in generalization tests. Our second question is answered by
SBT-ATEP, which is a novel transfer mechanism, inspired by the NeuroEvolution
of Augmenting Topologies (NEAT) speciation mechanism, outperforms FBT-ATEP
and EPOET on open-endedness and generalization tests while using fewer nodes
in the neural networks, thus answering both of our questions and proving our hy-
pothesis.

Our approach, however, does require more function evaluations than compet-
ing approaches. Thus, a promising future direction would be to use NEAT with
Novelty Search [56] or Surprise Search which tends to converge faster than simple
NEAT [38]. Quality-Diversity algorithms may also be worthwhile to explore in the
context of Open-ended Learning as they have the ability to produce a population
of high-performing and diverse individuals [69]. Exploring Neurogenesis [23, 61],


46
where neurons are added to a single neural network based on various external trig-
gers, could also be a promising direction. To reduce computational load, it would
also be promising to look into developing single-population Open-ended Learning
methods without losing the exploration abilities of EPOET.

Furthermore, we have opened up possible future research into transfer mecha-
nisms. We compared simple approaches such as FBT and SBT, but more advanced
approaches could yield further performance improvements. For instance, we could
combine both FBT and SBT in a weighted manner, or transfer only a certain per-
centage of a species or population. Finally, this work provides a starting point,
like EPOET itself, into Open-ended Learning with augmentative topology agents.
We, therefore, used the simple 2D BipedalWalker as our benchmark. Future work
should compare ATEP with standard EPOET in different and more complex en-
vironments. Ultimately, we hope that this new approach furthers research into
open-ended algorithms that do not slow down over time and can keep up with
an ever-changing environment.


47

Bibliography

[1] Abbas Abdolmaleki et al. “Relative entropy regularized policy iteration”. In:
arXiv preprint arXiv:1812.02256 (2018).

[2] Thomas Back. Evolutionary algorithms in theory and practice: evolution strate-
gies, evolutionary programming, genetic algorithms. Oxford university press,
1996.

[3] Thomas Bäck and Hans-Paul Schwefel. “An overview of evolutionary algo-
rithms for parameter optimization”. In: Evolutionary computation 1.1 (1993),
pp. 1–23.

[4] Matthew Barthet, Antonios Liapis, and Georgios N Yannakakis. “Open-Ended
Evolution for Minecraft Building Generation”. In: IEEE Transactions on Games
(2022).

[5] Varun Bhatt et al. “Deep surrogate assisted generation of environments”. In:
Advances in Neural Information Processing Systems 35 (2022), pp. 37762–37777.

[6] Eric Bonabeau et al. Swarm intelligence: from natural to artificial systems. 1. Ox-
ford university press, 1999.

[7] Jonathan C Brant and Kenneth O Stanley. “Minimal criterion coevolution: a
new approach to open-ended search”. In: Proceedings of the Genetic and Evo-
lutionary Computation Conference. 2017, pp. 67–74.

[8] Greg Brockman et al. “Openai gym”. In: arXiv preprint arXiv:1606.01540 (2016).

[9] Victor Campos et al. “Explore, Discover and Learn: Unsupervised Discovery
of State-Covering Skills”. In: Proceedings of the 37th International Conference on
Machine Learning. Vol. 119. 2020, pp. 1317–1327.

[10] Rich Caruana and Alexandru Niculescu-Mizil. “An empirical comparison
of supervised learning algorithms”. In: Proceedings of the 23rd international
conference on Machine learning. 2006, pp. 161–168.


48
[11] M Emre Celebi and Kemal Aydin. Unsupervised learning algorithms. Vol. 9.

Springer, 2016.

[12] Xi Chen et al. “Variational lossy autoencoder”. In: arXiv preprint arXiv:1611.02731
(2016).

[13] Maria Cipollone, Catherine C Schifter, and Rick A Moffat. “Minecraft as a
creative tool: A case study”. In: International Journal of Game-Based Learning
(IJGBL) 4.2 (2014), pp. 1–14.

[14] Maximilien Le Clei and Pierre Bellec. “Neuroevolution of recurrent architec-
tures on control tasks”. In: Genetic and Evolutionary Computation Conference,
Companion Volume. Ed. by Jonathan E. Fieldsend and Markus Wagner. 2022,
pp. 651–654.

[15] Cédric Colas et al. “Scaling map-elites to deep neuroevolution”. In: Proceed-
ings of the 2020 Genetic and Evolutionary Computation Conference. 2020, pp. 67–
75.

[16] Antoine Cully et al. “Robots that can adapt like animals”. In: Nature 521.7553
(2015), pp. 503–507.

[17] Zihang Dai et al. “Transformer-xl: Attentive language models beyond a fixed-
length context”. In: arXiv preprint arXiv:1901.02860 (2019).

[18] Michael Dennis et al. “Emergent complexity and zero-shot transfer via unsu-
pervised environment design”. In: Advances in Neural Information Processing
Systems 33 (2020), pp. 13049–13061.

[19] Aaron Dharna, Julian Togelius, and Lisa B Soros. “Co-generation of game
levels and game-playing agents”. In: Proceedings of the AAAI Conference on
Artificial Intelligence and Interactive Digital Entertainment. Vol. 16. 1. 2020, pp. 203–
209.

[20] Tansel Dokeroglu et al. “A survey on new generation metaheuristic algo-
rithms”. In: Computers & Industrial Engineering 137 (2019), p. 106040.

[21] Marco Dorigo, Mauro Birattari, and Thomas Stutzle. “Ant colony optimiza-
tion”. In: IEEE computational intelligence magazine 1.4 (2006), pp. 28–39.

[22] Kees Dorst and Nigel Cross. “Creativity in the design process: co-evolution
of problem–solution”. In: Design studies 22.5 (2001), pp. 425–437.


49
[23] Timothy J Draelos et al. “Neurogenesis deep learning: Extending deep net-

works to accommodate new classes”. In: 2017 International Joint Conference
on Neural Networks (IJCNN). IEEE. 2017, pp. 526–533.

[24] David B D’Ambrosio, Jason Gauci, and Kenneth O Stanley. “HyperNEAT:
The first five years”. In: Growing Adaptive Machines: Combining Development
and Learning in Artificial Neural Networks (2014), pp. 159–185.

[25] Sam Earle et al. “Illuminating diverse neural cellular automata for level gen-
eration”. In: Proceedings of the Genetic and Evolutionary Computation Confer-
ence. 2022, pp. 68–76.

[26] Adrien Ecoffet et al. “First return, then explore”. In: Nature 590.7847 (2021),
pp. 580–586.

[27] A W F Edwards. “The Genetical Theory of Natural Selection”. In: Genetics
154.4 (Apr. 2000), pp. 1419–1426. ISSN: 1943-2631. DOI: 10.1093/genetics/
154.4 .1419. eprint: https:/ /academic .oup.com/ genetics /article-
pdf/154/4/1419/42025776/genetics1419.pdf. URL: https://doi.org/10.
1093/genetics/154.4.1419.

[28] Benjamin Eysenbach et al. “Diversity is All You Need: Learning Skills with-
out a Reward Function”. In: International Conference on Learning Representa-
tions. 2019.

[29] Linxi Fan et al. “MineDojo: Building Open-Ended Embodied Agents with
Internet-Scale Knowledge”. In: arXiv preprint arXiv:2206.08853 (2022).

[30] Alhussein Fawzi et al. “Discovering faster matrix multiplication algorithms
with reinforcement learning”. In: Nature 610.7930 (2022), pp. 47–53.

[31] Sevan G Ficici and Jordan B Pollack. “Challenges in Coevolutionary Learn-
ing: Arms-Race Dynamics”. In: Artificial Life VI: Proceedings of the sixth inter-
national conference on artificial life. Vol. 6. MIT Press. 1998, p. 238.

[32] Vincent François-Lavet et al. “An introduction to deep reinforcement learn-
ing”. In: Foundations and Trends® in Machine Learning 11.3-4 (2018), pp. 219–
354.

[33] Edgar Galván and Peter Mooney. “Neuroevolution in deep neural networks:
Current trends and future challenges”. In: IEEE Transactions on Artificial In-
telligence 2.6 (2021), pp. 476–493.

https://doi.org/10.1093/genetics/154.4.1419
https://doi.org/10.1093/genetics/154.4.1419
https://academic.oup.com/genetics/article-pdf/154/4/1419/42025776/genetics1419.pdf
https://academic.oup.com/genetics/article-pdf/154/4/1419/42025776/genetics1419.pdf
https://doi.org/10.1093/genetics/154.4.1419
https://doi.org/10.1093/genetics/154.4.1419


50
[34] Chuang Gan et al. “OPEn: An Open-ended Physics Environment for Learn-

ing Without a Task”. In: 2021 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS). IEEE. 2021, pp. 5878–5885.

[35] Fred Glover and Manuel Laguna. “Tabu search”. In: Handbook of combinato-
rial optimization. Springer, 1998, pp. 2093–2229.

[36] Faustino J Gomez, Risto Miikkulainen, et al. “Solving non-Markovian con-
trol tasks with neuroevolution”. In: IJCAI. Vol. 99. 1999, pp. 1356–1361.

[37] Scott R Granter, Andrew H Beck, and David J Papke Jr. “AlphaGo, deep
learning, and the future of the human microscopist”. In: Archives of pathology
& laboratory medicine 141.5 (2017), pp. 619–621.

[38] Daniele Gravina, Antonios Liapis, and Georgios Yannakakis. “Surprise search:
Beyond objectives and novelty”. In: Proceedings of the Genetic and Evolutionary
Computation Conference 2016. 2016, pp. 677–684.

[39] Frederic Gruau, Darrell Whitley, and Larry Pyeatt. “A comparison between
cellular encoding and direct encoding for genetic neural networks”. In: Pro-
ceedings of the 1st annual conference on genetic programming. 1996, pp. 81–89.

[40] Danijar Hafner et al. “Mastering Diverse Domains through World Models”.
In: arXiv preprint arXiv:2301.04104 (2023).

[41] John H Holland. Adaptation in natural and artificial systems: an introductory
analysis with applications to biology, control, and artificial intelligence. MIT press,
1992.

[42] John H Holland. “Genetic algorithms”. In: Scientific american 267.1 (1992),
pp. 66–73.

[43] John J Hopfield. “Artificial neural networks”. In: IEEE Circuits and Devices
Magazine 4.5 (1988), pp. 3–10.

[44] Minqi Jiang, Edward Grefenstette, and Tim Rocktäschel. “Prioritized level
replay”. In: International Conference on Machine Learning. PMLR. 2021, pp. 4940–
4950.

[45] Peiyuan Jiang et al. “A Review of Yolo algorithm developments”. In: Procedia
Computer Science 199 (2022), pp. 1066–1073.

[46] John Jumper et al. “AlphaFold 2”. In: In Fourteenth Critical Assessment of Tech-
niques for Protein Structure Prediction (Abstract Book (2020).


51
[47] Sourabh Katoch, Sumit Singh Chauhan, and Vijay Kumar. “A review on ge-

netic algorithm: past, present, and future”. In: Multimedia Tools and Applica-
tions 80.5 (2021), pp. 8091–8126.

[48] James Kennedy and Russell Eberhart. “Particle swarm optimization”. In:
Proceedings of ICNN’95-international conference on neural networks. Vol. 4. IEEE.
1995, pp. 1942–1948.

[49] Miklos Kepes, Nicholas Guttenberg, and LB Soros. “ChemGrid: An Open-
Ended Benchmark Domain for an Open-Ended Learner”. In: 2022 IEEE Con-
ference on Games (CoG). IEEE. 2022, pp. 221–228.

[50] Ahmed Khalifa et al. “Talakat: Bullet hell generation through constrained
map-elites”. In: Proceedings of The Genetic and Evolutionary Computation Con-
ference. 2018, pp. 1047–1054.

[51] Khimya Khetarpal et al. “Towards Continual Reinforcement Learning: A Re-
view and Perspectives”. In: CoRR abs/2012.13490 (2020). arXiv: 2012.13490.
URL: https://arxiv.org/abs/2012.13490.

[52] B Ravi Kiran et al. “Deep reinforcement learning for autonomous driving: A
survey”. In: IEEE Transactions on Intelligent Transportation Systems 23.6 (2021),
pp. 4909–4926.

[53] Vijay Kumar, Jitender Kumar Chhabra, and Dinesh Kumar. “Parameter adap-
tive harmony search algorithm for unimodal and multimodal optimization
problems”. In: Journal of Computational Science 5.2 (2014), pp. 144–155.

[54] Matt J Kusner, Brooks Paige, and José Miguel Hernández-Lobato. “Gram-
mar variational autoencoder”. In: International Conference on Machine Learn-
ing. 2017, pp. 1945–1954.

[55] H Chad Lane and Sherry Yi. “Playing with virtual blocks: Minecraft as a
learning environment for practice and research”. In: Cognitive development in
digital contexts. Elsevier, 2017, pp. 145–166.

[56] Joel Lehman, Kenneth O Stanley, et al. “Exploiting open-endedness to solve
problems through the search for novelty.” In: ALIFE. Citeseer. 2008, pp. 329–
336.

https://arxiv.org/abs/2012.13490
https://arxiv.org/abs/2012.13490


52
[57] Timothée Lesort et al. “Continual learning for robotics: Definition, frame-

work, learning strategies, opportunities and challenges”. In: Inf. Fusion 58
(2020), pp. 52–68. DOI: 10 . 1016 / j . inffus . 2019 . 12 . 004. URL: https :
//doi.org/10.1016/j.inffus.2019.12.004.

[58] Shutao Li et al. “Deep learning for hyperspectral image classification: An
overview”. In: IEEE Transactions on Geoscience and Remote Sensing 57.9 (2019),
pp. 6690–6709.

[59] Zhizhong Li and Derek Hoiem. “Learning Without Forgetting”. In: Lecture
Notes in Computer Science 9908 (2016). Ed. by Bastian Leibe et al., pp. 614–
629. DOI: 10.1007/978-3-319-46493-0\_37. URL: https://doi.org/10.
1007/978-3-319-46493-0\_37.

[60] Antonios Liapis, Georgios N Yannakakis, and Julian Togelius. “Neuroevo-
lutionary constrained optimization for content creation”. In: 2011 IEEE Con-
ference on Computational Intelligence and Games (CIG’11). IEEE. 2011, pp. 71–
78.

[61] Kaitlin Maile et al. “When, where, and how to add new neurons to ANNs”.
In: International Conference on Automated Machine Learning. PMLR. 2022, pp. 18–
1.

[62] Robert Meier and Asier Mujika. “Open-Ended Reinforcement Learning with
Neural Reward Functions”. In: arXiv preprint arXiv:2202.08266 (2022).

[63] Zbigniew Michalewicz and Marc Schoenauer. “Evolutionary algorithms for
constrained parameter optimization problems”. In: Evolutionary computation
4.1 (1996), pp. 1–32.

[64] Tom M Mitchell. Machine learning. Vol. 1. 9. McGraw-hill New York, 1997.

[65] David J Montana, Lawrence Davis, et al. “Training feedforward neural net-
works using genetic algorithms.” In: IJCAI. Vol. 89. 1989, pp. 762–767.

[66] David E Moriarty and Risto Miikkulainen. “Forming neural networks through
efficient and adaptive coevolution”. In: Evolutionary computation 5.4 (1997),
pp. 373–399.

[67] Hirotaka Moriguchi and Shinichi Honiden. “CMA-TWEANN: efficient op-
timization of neural networks via self-adaptation and seamless augmenta-
tion”. In: Proceedings of the 14th annual conference on Genetic and evolutionary
computation. 2012, pp. 903–910.

https://doi.org/10.1016/j.inffus.2019.12.004
https://doi.org/10.1016/j.inffus.2019.12.004
https://doi.org/10.1016/j.inffus.2019.12.004
https://doi.org/10.1007/978-3-319-46493-0\_37
https://doi.org/10.1007/978-3-319-46493-0\_37
https://doi.org/10.1007/978-3-319-46493-0\_37


53
[68] Jean-Baptiste Mouret and Jeff Clune. “Illuminating search spaces by map-

ping elites”. In: arXiv preprint arXiv:1504.04909 (2015).

[69] Olle Nilsson and Antoine Cully. “Policy gradient assisted map-elites”. In:
Proceedings of the Genetic and Evolutionary Computation Conference. 2021, pp. 866–
875.

[70] Emma Stensby Norstein, Kai Olav Ellefsen, and Kyrre Glette. “Open-Ended
Search for Environments and Adapted Agents Using MAP-Elites”. In: In-
ternational Conference on the Applications of Evolutionary Computation (Part of
EvoStar). Springer. 2022, pp. 651–666.

[71] Charles Ofria and Claus O Wilke. “Avida: A software platform for research
in computational evolutionary biology”. In: Artificial life 10.2 (2004), pp. 191–
229.

[72] Eirik Solheim Ølberg. “CHILD: Predicting simulation outcome in open-ended
co-evolution”. MA thesis. 2022.

[73] Keiron O’Shea and Ryan Nash. “An introduction to convolutional neural
networks”. In: arXiv preprint arXiv:1511.08458 (2015).

[74] Johannes von Oswald et al. “Continual learning with hypernetworks”. In:
8th International Conference on Learning Representations, ICLR 2020, Addis Ababa,
Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL: https://openreview.
net/forum?id=SJgwNerKvB.

[75] Norman Packard et al. “An overview of open-ended evolution: Editorial in-
troduction to the open-ended evolution ii special issue”. In: Artificial life 25.2
(2019), pp. 93–103.

[76] Jan Paredis. “Coevolutionary computation”. In: Artificial life 2.4 (1995), pp. 355–
375.

[77] German Ignacio Parisi et al. “Continual lifelong learning with neural net-
works: A review”. In: Neural Networks 113 (2019), pp. 54–71. DOI: 10.1016/
j.neunet.2019.01.012. URL: https://doi.org/10.1016/j.neunet.2019.
01.012.

[78] Jack Parker-Holder et al. “Evolving curricula with regret-based environment
design”. In: International Conference on Machine Learning. PMLR. 2022, pp. 17473–
17498.

https://openreview.net/forum?id=SJgwNerKvB
https://openreview.net/forum?id=SJgwNerKvB
https://doi.org/10.1016/j.neunet.2019.01.012
https://doi.org/10.1016/j.neunet.2019.01.012
https://doi.org/10.1016/j.neunet.2019.01.012
https://doi.org/10.1016/j.neunet.2019.01.012


54
[79] Jigar Patel et al. “Predicting stock and stock price index movement using

trend deterministic data preparation and machine learning techniques”. In:
Expert systems with applications 42.1 (2015), pp. 259–268.

[80] Martin L Puterman. “Markov decision processes”. In: Handbooks in operations
research and management science 2 (1990), pp. 331–434.

[81] Alexander Quessy and Thomas Stuart Richardson. “Rewardless Open-Ended
Learning (ROEL)”. In: OpenReview (2021).

[82] Nicholas J Radcliffe. “Genetic set recombination and its application to neu-
ral network topology optimisation”. In: Neural Computing & Applications 1.1
(1993), pp. 67–90.

[83] BR Rajakumar. “Static and adaptive mutation techniques for genetic algo-
rithm: a systematic comparative analysis”. In: International Journal of Compu-
tational Science and Engineering 8.2 (2013), pp. 180–193.

[84] Thomas S Ray. “An approach to the synthesis of life”. In: Artificial life II
(1991), pp. 371–408.

[85] Raul Rojas and Raúl Rojas. “The backpropagation algorithm”. In: Neural net-
works: a systematic introduction (1996), pp. 149–182.

[86] Sebastian Ruder. “An overview of gradient descent optimization algorithms”.
In: arXiv preprint arXiv:1609.04747 (2016).

[87] Andrei A. Rusu et al. “Progressive Neural Networks”. In: CoRR abs/1606.04671
(2016). arXiv: 1606.04671. URL: ht