PROCEDURAL CONTENT GENERATION

FOR VIDEO GAME LEVELS WITH HUMAN

ADVICE
School of Computer Science & Applied Mathematics

University of the Witwatersrand

Nicholas Raal
793528

Supervised by Dr Steven James

July 20, 2023

Ethics Clearance Number: CSAM-2022-10

A research report submitted to the Faculty of Science, University of the Witwatersrand,
Johannesburg, in partial fulfilment of the requirements for the degree of Master of Science


Abstract

Video gaming is an extremely popular form of entertainment around the world and
new video game releases are constantly being showcased. One issue with the video
gaming industry is that game developers require a large amount of time to develop new
content. A research field that can help with this is procedural content generation (PCG)
which allows for an infinite number of video game levels to be generated based on the
parameters provided. Many of the methods found in literature can generate content
reliably that adhere to quantifiable characteristics such as playability, solvability and
difficulty. These methods do not however, take into account the aesthetics of the level
which is the parameter that makes them more reasonable levels for human players. In
order to address this issue, we propose a method of incorporating high level human
advice into the PCG loop. The method uses pairwise comparisons as a way in which
a score can be assigned to a level based on its aesthetics. Using the score along with
a feature vector describing each level, an SVR model is trained that will allow for a
score to be assigned to unseen video game levels. This predicted score is used as
an additional fitness function of a multiobjective genetic algorithm (GA) and can be
optimised as a standard fitness function would. We test the proposed method on two 2D
platformer video games, Maze and Super Mario Bros (SMB), and our results show that
the proposed method can successfully be used to generate levels with a bias towards
the human preferred aesthetical features, whilst still adhering to standard video game
characteristics such as solvability. We further investigate incorporating multiple inputs
from a human at different stages of the PCG life cycle and find that it does improve the
proposed method, but further testing is still required. The findings of this research is
hopefully going to assist in using PCG in the video game space to create levels that are
more aesthetically pleasing to a human player.

i


Declaration

I, Nicholas Oliver Raal, hereby declare the contents of this research report to be my
own, unaided work. This research report is being submitted for the Degree of Master
of Science in Computer Science at the University of the Witwatersrand, Johannesburg.
This work has not been submitted to any other university, or for any other degree.

ii


Acknowledgements

This research would not have been possible without the invaluable input and assistance
provided by Dr. Steven James.

iii


Contents

Preface
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

1 Introduction 1

2 Background and Related Work 4
2.1 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Genetic Algorithms in PCG . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Machine Learning Approaches . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Research Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 TrueSkill . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Learning a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Research Approach 16
3.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3.1 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.2 Pairwise Comparisons . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.3 TrueSkill . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.4 Feature Detection and Extraction . . . . . . . . . . . . . . . . . . 22
3.3.5 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . 23
3.3.6 Fitness Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Results and Analysis 27
4.1 Hyperparameter Optimisation . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Maze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.1 Baseline Human Fitness . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.2 Baseline Solvability Fitness . . . . . . . . . . . . . . . . . . . . . 32
4.2.3 Maze Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 Mario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

iv


4.3.1 Baseline Human Fitness . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.2 Baseline Solvability and Entropy . . . . . . . . . . . . . . . . . . 46
4.3.3 Mario Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5 Conclusion 52
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

References 58

v


List of Figures

2.1 PCGRL system architecture framework for content generation proposed
by Khalifa et al. [2020] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 TrueSkill Gaussian belief curve [Herbrich et al. 2007; Minka et al. 2018] 11

3.1 Flow diagram showing a high level overview of the experimental struc-
ture for the PCG process. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1 Example of a Super Mario Bros level of size (114× 14) . . . . . . . . . . 27
4.2 Example of a Maze level of size (14× 14) . . . . . . . . . . . . . . . . . . 27
4.3 Figure showing a randomly generated Maze level, of size 14 × 14, fol-

lowed by the same level overlaid with detected SIFT feature keypoints.
Circle size indicates the scale of the keypoint and the lines indicate the
keypoint orientation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4 Graph showing the comparison between low and high initial population
sizes and their effect on the average human fitness and average number
of black border pixels on the best generated level for 100 iterations;
black lines indicate standard deviation. . . . . . . . . . . . . . . . . . . . 31

4.5 Graph showing the comparison between low and high GA iterations and
their effect on the average solvability fitness and average number of
black border pixels for the best generated level; black lines indicate stan-
dard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.6 (a) Maze level generated using only the Solvability fitness function (25
black, 31 white). (b) Maze level generated using only the Human fit-
ness function (40 black, 16 white). (c) Maze level generated using both
Solvability and Human fitness functions (32 black, 24 white). . . . . . . 33

4.7 (a, c, e, g, i) SVR model prediction before optimisation; (b, d, f, h, j) SVR
model prediction after optimisation. . . . . . . . . . . . . . . . . . . . . 35

4.8 Figure showing solvability fitness, human fitness and number of black
border pixels averaged over 5 different seeds for 100 GA iterations with
an initial population size of 20; baseline plots are indicated by dotted
lines; shading indicates standard deviation. . . . . . . . . . . . . . . . . 36

4.9 Figure showing solvability fitness, human fitness and number of black
border pixels averaged over 5 different seeds for 100 GA iterations with
an initial population size of 100; baseline plots are indicated by dotted
lines; shading indicates standard deviation. . . . . . . . . . . . . . . . . 37

vi


4.10 Figure showing solvability fitness, human fitness and number of black
border pixels averaged over 5 different seeds for 100 GA iterations with
an initial population size of 200; baseline plots are indicated by dotted
lines; shading indicates standard deviation. . . . . . . . . . . . . . . . . 37

4.11 Figure showing solvability fitness, human fitness and number of black
border pixels averaged over 5 different seeds for 100 GA iterations with
an initial population size of 500; baseline plots are indicated by dotted
lines; shading indicates standard deviation. . . . . . . . . . . . . . . . . 38

4.12 Figure showing solvability fitness, human fitness and number of black
border pixels averaged over 5 different seeds for 100 GA iterations with
an initial population size of 1000; baseline plots are indicated by dotted
lines; shading indicates standard deviation. . . . . . . . . . . . . . . . . 39

4.13 Figure showing how varying the number of iterations affects the solv-
ability fitness and average number of black border pixels averaged over
5 different seeds for a population size of 200; shading indicates standard
deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.14 Flow diagram showing a high level overview of the experimental struc-
ture for the multi human input PCG process. . . . . . . . . . . . . . . . . 41

4.15 Figure showing 3 separate instances at which the virtual human inputs
advice, shown as vertical dotted purple lines at x=0, x=80 and x=160. . 42

4.16 Figure showing a randomly generated Super Mario Bros level and the
features detected on the level using SIFT. . . . . . . . . . . . . . . . . . 43

4.17 Figure showing a zoomed in section of the SIFT features detected in
Figure 4.16. Circle size indicates how prominent the keypoint is and the
lines indicate the keypoint orientation. . . . . . . . . . . . . . . . . . . . 44

4.18 (a) SVR model prediction before optimisation for a population size of
500 trained using SIFT feature vectors (b) SVR model prediction after
optimisation for a population size of 500 trained using SIFT feature vec-
tors (c) SVR model prediction before optimisation for a population size
of 500 trained using a flattened feature vector of the level (d) SVR model
prediction after optimisation for a population size of 500 trained using a
flattened feature vector of the level. . . . . . . . . . . . . . . . . . . . . . 45

4.19 Graph showing the comparison between varying initial population size
when only the human fitness is being optimised by the GA for 100 itera-
tions; standard deviation is indicated by the black lines. . . . . . . . . . 46

4.20 Graph showing the comparison between varying GA iterations and their
affect on average solvability, entropy and the average number of pipe
blocks in the bottom row of the best generated level of the iteration;
standard deviation is indicated by the black lines. . . . . . . . . . . . . . 47

4.21 Figure showing solvability, entropy, human fitness and number of pipe
blocks in the ground row of a Mario level for 100 GA iterations with an
initial population size of 20, averaged over 5 different seeds; baselines
for number of pipe blocks, solvability and entropy fitness are indicated
by dotted lines; shading indicates standard deviation. . . . . . . . . . . . 49

vii


4.22 Figure showing solvability, entropy, human fitness and number of pipe
blocks in the ground row of a Mario level for 100 GA iterations with an
initial population size of 100, averaged over 5 different seeds; baselines
for number of pipe blocks, solvability and entropy fitness are indicated
by dotted lines; shading indicates standard deviation. . . . . . . . . . . . 50

4.23 Figure showing solvability, entropy, human fitness and number of pipe
blocks in the ground row of a Mario level for 100 GA iterations with an
initial population size of 200, averaged over 5 different seeds; baselines
for number of pipe blocks, solvability and entropy fitness are indicated
by dotted lines; shading indicates standard deviation. . . . . . . . . . . . 51

4.24 Figure showing solvability, entropy, human fitness and number of pipe
blocks in the ground row of a Mario level for 100 GA iterations with an
initial population size of 500, averaged over 5 different seeds; baselines
for number of pipe blocks, solvability and entropy fitness are indicated
by dotted lines; shading indicates standard deviation. . . . . . . . . . . . 51

viii


List of Tables

4.1 Grid search ranges for each parameter . . . . . . . . . . . . . . . . . . . 28
4.2 Table showing the binary version of the Maze level shown in Figure 4.3 . 30

ix


Chapter 1

Introduction

Procedural content generation (PCG) is an algorithmic process which allows for the
automation of the design of video game components based on the requirements of the
player or developer. These components range from textures, levels, enemies, characters
and much more [Barriga 2018; Shu et al. 2021a]. PCG has many other applications,
some of which are music and film [Barriga 2018]. A key element to note between
game related PCG and PCG in other domains is that video game PCG has structural
constraints that ensures the playability of the video game [Summerville et al. 2017].
The benefit of PCG for video games is that it allows for content to be generated without
the need of a human game developer to manually design every element, increasing the
efficiency of the game developers.

The field of PCG for video game levels has grown in popularity over recent years.
PCG in video games has been used since the early 1980’s in games such as Rogue and
Elite. PCG was used in these games as a way to generate each dungeon, which allowed
for replayability [Barriga 2018]. The most recent adoption of PCG in video games is
from the developers of No Man’s Sky which makes use of online PCG which enables it to
generate an infinite number of planets in the expanding universe of the game [Barriga
2018].

PCG for level generation has had success when it comes to 2D video game levels
[Barriga 2018; Shu et al. 2021a]. PCG can either be done in an offline or online manner
depending on what is required from the game developers. The main difference being
that online PCG is more dynamic and interactive but needs to be done in real-time in
order to not negatively impact the game performance [Shu et al. 2021a]. Offline PCG
on the other hand occurs during the games development stage.

An example of using online PCG is Shu et al. [2021a] who proposed a way in which
player experience can be used to dynamically change the difficulty of the level, in real-
time, based on the player’s skill. Mawhorter and Mateas [2010] proposed a mixed
method in which a level was made up of chunks of generated content and human de-

1


signed content which was done using offline PCG. Linden et al. [2013] used gameplay
mechanics as the gameplay grammar that they used to generate action graphs which
were used for the level layout. None of these approaches however tackle the aesthetics
of the level being generated.

Using PCG in level design provides a very useful tool for game designers as it helps
to shift their skills to other aspects of game design, whilst generating an unlimited num-
ber of levels. The issue at hand is that although PCG has been successful in the actual
gameplay mechanics, such as the level being solvable and difficult, the actual aesthetics
of the video game level does not yet match one that is reasonable to humans. Although
the generated levels can be played by a human, they do not look appealing which can
cause players to lose interest as well as removing the usefulness towards the feasibility
of PCG for video game level generation. By improving on the aesthetics of video game
levels such that they look more like those designed by a human game developer, the
interest of a human player would be increased. This is an example of how PCG can be
improved in such a way that it will become much more useful to the game design field
than it is currently.

For the research conducted in this paper, 2D platformer video games, Super Mario
Bros (SMB) and Maze are used in order test the proposed PCG method of bringing
level aesthetics into the loop. Both of these games will be explored in order to give a
more generalized view as both of these games look completely different, have different
level sizes and follow different gameplay mechanics. Additionally, each of these video
games have been used throughout literature making them ideal candidates [Khalifa et
al. 2020; Yannakakis and Togelius 2011; Shu et al. 2021a; Beukman et al. 2022a].

Previous works on PCG in level design have not taken into account the human pref-
erence of the levels being generated. This is because the aesthetics of a video game
level is not a standard metric that can be easily measured, as it is a human visual pref-
erence. But how would you go about assigning a metric to something like aesthetics?
This research takes inspiration from the work of Raskar et al. [2015] where a ranking is
assigned to Google Streetview images based on whether they look safe or not. By using
a similar approach, we propose a way in which human advice can be incorporated into
a PCG loop in order to generate levels that are more aesthetically pleasing to a human
player. The approach will assign a score to a video game level based on the aesthetic
preference of the human viewing the level. We will however require multiple levels so
that the scores have structure such that higher scores resemble levels which are more
aesthetically pleasing to the human. This will be done using a tournament structure
where levels are considered contestants and the outcome is determined by pairwise
comparisons. In doing so, levels with a better aesthetic preference will be assigned a
score higher than those that are not aesthetically pleasing. By using these scores and
feature vectors describing the level, a regression model can be developed that will be
able to accurately assign a score to unseen generated video game levels. This score will
then be used in conjunction with a fitness function of a genetic algorithm (GA) which

2


can be optimised as any other fitness function in order to generate new levels contain-
ing features the human finds aesthetically pleasing.

From the research conducted, we made use of TrueSkill, which is a probabilistic
skill rating system developed for online gaming in order to match players with similar
skills against one another. It was found that using TrueSkill to assign a score to a video
game level by means of pairwise comparisons, and using the TrueSkill data along with
the level feature data, a regression model can be trained and is able to successfully pre-
dict the TrueSkill scores for unseen levels. The regression model is used within a fitness
function for a GA which can be optimised as any standard fitness function in order to
bias the GA into generating levels which contain features that resemble a high TrueSkill
score. Thus, it can be determined that the method used in this research can successfully
be used to generate 2D video game levels that contain features which are aesthetically
preferred by a human. Along with this, the results obtained show that characteristics
of video games, such as solvability, are maintained in the newly generated levels. This
finding is hopefully going to assist in the use of creating video game levels with PCG
that are more aesthetically pleasing to a human player.

Previous work related to PCG and genetic algorithms as well as background infor-
mation for each of the steps taken in the development of the implementation is dis-
cussed in Chapter 2. The goals of the research are presented and the implementation
steps and design choices are described in Chapter 3. Chapter 4 provides the results and
analysis for each of the experiments conducted. The conclusion to the findings of this
paper are stated in Chapter 5.

3


Chapter 2

Background and Related Work

The field of PCG in video games can be broken down into multiple different elements,
which range from rules, gameplay mechanics, puzzle solving, map design and level
design to name a few. The research conducted in this paper focuses on the level gen-
eration in video games and as such the other elements will not be discussed in detail.
The term level design in the context of this research refers to video game levels in 2D
video games such as Super Mario Bros (SMB).

The next sections of the chapter will discuss what genetic algorithms are and how
they have been used in a PCG space along with machine learning based approaches to
PCG. This is followed by background information on each of the sections that make up
the research methodology.

2.1 Genetic Algorithms

One of the most common approaches to PCG is to use genetic algorithms (GA) [Katoch
et al. 2021; Togelius et al. 2011; Pedersen et al. 2009; Yannakakis and Togelius 2011].
A genetic algorithm is a search-based optimisation algorithm that took inspiration from
the theory of natural evolution. By following natural evolution, the algorithm selects
the fittest individuals from the population as the parents to future generations, thus fol-
lowing survival of the fittest. A typical genetic algorithm is described in the pseudocode
shown in algorithm 1.

The algorithm begins with a set of individuals, commonly called a population.
These individuals are made up of multiple genes which are encoded, mostly commonly
using a binary encoding scheme, to form a chromosome, which is a string of genes. The
purpose of the encoding is to improve the speed of the crossover and mutation steps.
Other encoding scheme methods are hexadecimal, value-based and tree [Katoch et al.
2021]. A fitness function is used in order to calculate a fitness score for each of the
chromosomes. This score is used as a metric to determine which chromosome will be

4


Algorithm 1 Standard Genetic Algorithm adapted from Katoch et al. [2021]

Require: n←Population size
Require: m← Maximum number of iterations
Ensure: Ybt ←Global best solution

Randomly generate initial population of size n
t← 0
Compute fitness function for each chromosome
while t ̸= m do

Select a pair of chromosomes based on fitness, c1 and c2
Apply crossover operator with crossover probability Cp

Apply mutation on offspring O with probability Mp to get O′

Replace the old population with the new population
t← t+ 1

end while
Return Ybt ←best solution

selected for reproduction for the future generation. The next step is the selection phase
where a pair of chromosomes need to be selected from the population, c1 and c2, based
on their fitness score. Chromosomes with a high fitness score will have a higher chance
to be selected, as they are the better picks for reproduction of the future generation.
The convergence rate of the genetic algorithm is directly impacted by the selection step.

From the chromosome pair, a crossover operation is performed with a crossover
probability of Cp in order to produce an offspring, O, which is added to the population.
Typically a single point crossover is used, where the genes after the crossover point are
swapped [Katoch et al. 2021]. The crossover point is chosen randomly within the genes
of the chromosomes. The offspring is a chromosome that has a mixture of the genes
from each parent until the crossover point is reached. As an example, consider individ-
uals c1 and c2 as each parent and individuals c3 and c4 as their corresponding offspring.
It is clear that c1 and c3 share the same four initial genes, and the same relationship
occurs between c2 and c4. This is until the crossover point is reached, |, where the last
four genes of the parents are switched in the offspring genes, thus the last four genes
of c1 and c2 become the last four genes of c4 and c3 respectively. This relationship exists
so that each offspring individual is made up of both parents genes.

c1 = 0110|0011 c3 = 0110|1110
c2 = 1100|1110 c4 = 1100|0011

The new offspring has a small probability, Mp, of its genes mutating which flips the
genes in order to prevent premature convergence within the population. One of the
simplest forms of mutation for a binary encoding scheme is by simply flipping some of
the bits of the offspring [Katoch et al. 2021]. For example, consider a mutation occur-
ring on the offspring c3. Individual c′3 shows what a potential mutated c3 could be if

5


the 4th, 5th and 6th bits were to be mutated. The mutated offspring, O′, is then added
to the population. Each of these steps are repeated until a new population has been
generated [Katoch et al. 2021].

c′3 = 01110010

A different crossover operation is two point crossover where there are two crossover
points for which the parent individuals will swap their genes [Katoch et al. 2021]. As
an example, consider individuals b1 and b2 where the crossover points are shown as |.
When a crossover point is reached, the genes of the parents will swap. In two point
crossover, there are two crossover points which means that the parents genes will swap
twice, ultimately meaning that the middle segment in each parent is swapped in order
to produce the offspring, b3 and b4.

b1 = 011|000|11 b3 = 011|011|11
b2 = 110|011|10 b4 = 110|000|10

2.1.1 Genetic Algorithms in PCG

Pedersen et al. [2009] proposed a way in which to personalize the level generation for
Infinite Super Mario Bros, a modified version of the original game. The important fea-
tures that must be added to the Super Mario Bros level are blocks, gaps and enemies in
order for the level to be deemed complete. By modifying the pre-existing level genera-
tor, Pedersen et al. [2009] used four controllable parameters: number of gaps, average
gap width, spatial diversity of the gaps and the number of direction switches, to gener-
ate the new levels. In addition to the controllable parameters, five statistical features
were taken into account for the level generation: time, death, jump, item and kill. All
of these parameters were combined into a single vector which is then converted into a
level in a stochastic way. The vector is used as the input to an artificial neural network
(ANN) which then maps to a potential emotional preference of a human player. The
primary emotional preferences considered by Pedersen et al. [2009] were fun, chal-
lenge and frustration. The data used for the training of the network was collected over
the internet using a questionnaire.

Sorenson and Pasquier [2010] proposed a different approach compared to most lit-
erature, in which the exact details of how a level should be generated are not required,
rather the desired properties of the level to be generated are used. In order to achieve
this, a genetic algorithm is used, more specifically, a Feasible-Infeasible Two-Population
(FI-2Pop) genetic algorithm. The algorithm starts with an infeasible population, which
consists of levels that do not adhere to the basic level requirements. These infeasible
individuals follow the genetic algorithm to evolve individuals that satisfy all of the level

6


constraints, in which case they become the feasible population. The feasible popula-
tion is evaluated using a fitness function which is ultimately used to generate the levels
which satisfy the requirements of a playable level and are fun to the user. In order
to determine whether or not the generated level is fun, Sorenson and Pasquier [2010]
used a genetic method proposed in some of their earlier work where a fitness function
is used to identify if a given level is fun or not. The approach was tested on both SMB
and a Zelda like adventure game.

Yannakakis and Togelius [2011] proposed a framework for PCG that computation-
ally models the user experience in order to provide the player with an optimised experi-
ence. The framework proposed is called experience-driven procedural content genera-
tion (EDPCG) which is built up of four key components: content representation, player
experience modeling, content generator and content quality. In order to test the pro-
posed framework, Yannakakis and Togelius [2011] used Super Mario Bros as a means
to provide personalized level generation. The level itself was represented as a short pa-
rameter vector containing the number, size and location of gaps along with a boolean
for the switching mechanic. Yannakakis and Togelius [2011] used data collected from
hundreds of human players who ranked various levels in terms of fun, challenge, pre-
dictability, frustration, boredom and anxiety. The data was used for training a neural
network which would predict the player states by using evolutionary preference learn-
ing. Although the framework was tested on Super Mario Bros, the concept behind the
framework can be applied to various other 2D video games.

Togelius et al. [2010] proposed an approach to search-based PCG by making use
of multiobjective evolution, whereas most literature only focus on single objective evo-
lution, to generate complete maps in a video game. Togelius et al. [2010] chose the
video game Starcraft to test their procedural content generation where entire maps
were generated. A total of eight fitness functions were used in order to reflect the
various game characteristics, such as playability, fairness, skill differentiation and in-
terestingness. A state-of-the-art multiobjective evolutionary algorithm (MOEA) called
SMS-EMOA (Hochstrate et al. [2007]) was used by Togelius et al. [2010] to ensure the
individuals of the population remained within the region to ultimately achieve a suffi-
cient representation of a Pareto front.

2.2 Machine Learning Approaches

Machine learning (ML) PCG has brought a new understanding to the PCG field as
well as having the potential to expand the field in the near future. Summerville et al.
[2017] surveyed methods of using machine learning in the PCG field. Machine learn-
ing improves autonomous generation compared to that of search-based methods as it
removes the validation step that requires an evaluation function to validate the content
since the inputs to the network can be specified for the content required [Summerville
et al. 2017]. The downside to machine learning PCG is that machine learning requires

7


a lot of training data, which is often not the case for video games [Summerville et al.
2017]. The MLPCG approach by Summerville et al. [2017] focuses their content gen-
eration on functionality, whereas this research will focus on both functionality as well
as the aesthetics of the generated content.

A downside to the ML approach to PCG is that its performance is dependent on
training data. A reinforcement learning (RL) approach can perform similarly to ML,
but does not have the same dependency on training data. One of the first frameworks
for using reinforcement learning in a PCG setting was proposed by Khalifa et al. [2020].
The framework proposed by Khalifa et al. [2020] is shown in Figure 2.1. Due to RL be-
ing used, the process of PCG was seen as an iterative trask, rather than generating all
of the content at once which is done in search-based methods. In order for this to be
done, Khalifa et al. [2020] modeled content as a MDP. The iterative approach begins
by having the level populated randomly with a set of tiles. At each state, St, the agent
can change a single tile, At, within the level which then gets judged with respect to the
goal of the level, producing a reward, r, for the agent. Khalifa et al. [2020] decided
to generalize the framework so that it could be implemented in other games. This was
achieved by splitting the framework into three parts: problem module, representation
module and change percentage. The problem module stores all of the information
about the video game level, examples include the reward function and goal. The rep-
resentation module transforms the current level into an observation state that is viable
for the agent. In the representation module, three different formulations of the actions
were used: narrow, turtle and wide. For the narrow formulation, the agent is provided
with the current state and a location where it is able to make a change. In turtle, the
agent can move around at each time step and change the tiles along its path. For wide,
at each time step the agent has complete control over the location and tile type [Khalifa
et al. 2020]. The final module, change percentage, limits the number of changes the
agent can make per episode which is used as a way to prevent the agent from making
too many content changes. The change percentage works similarly to a discount factor
in order to determine how greedy the agent should be. Khalifa et al. [2020] designed
the agent with limited actions in order to force it to only be able to take a few actions in
the environment. This was necessary as an optimal agent would always converge to a
single optimal solution whereas that is not the desired outcome of the PCG level. Once
the agents completed their training, they were used as generators for new levels, where
they would alter randomly initialized levels for the given number of iterations. Khalifa
et al. [2020] tested their framework on three simple video games: Maze, The Legend
of Zelda and Sokoban where good results were found for Maze, however the agent was
unable to effectively generate harder levels for Zelda and Sokoban.

Shu et al. [2021b] proposed a new framework that combines each of the frame-
works proposed by Khalifa et al. [2020] and Yannakakis and Togelius [2011] into the
ED(PCG)RL framework. The proposed framework allows for the generation of person-
alized content using reinforcement learning by making use of experience-driven reward
functions. By building upon PCGRL, Shu et al. [2021b] have managed to generate end-
less online levels that adhere to aesthetical and functional properties. A latent vector is

8


Figure 2.1: PCGRL system architecture framework for content generation proposed by
Khalifa et al. [2020]

used to represent the state and action of the reinforcement learning agent, which allows
the framework to be applied to any game that features levels which can be represented
by segmentation and compression, for example 2D Atari games. A content quality com-
ponent is used as a way to guarantee the content quality of the experience model. This
quality is that which has been tested in gameplay simulations. The agent can then take
an action corresponding to a generative act which alters the state of the content and
then receives a reward via the experience model function. The state-action and experi-
ence representations are traversed iteratively in order to find a design policy that will
optimise the experience model. Shu et al. [2021b] tested their framework on Super
Mario Bros where the framework can be broken down into three components: repairer
and generator of non-faulty segments, artificial SMB player to test segment playabil-
ity and reinforcement learning agent that plays the Super Mario Bros endless-platform
generation game. For the generator, MarioGAN proposed by Volz et al. [2018], is used
along with CNet-assisted Evolutionary Repairer to successfully generate the level seg-
ments. The approach differs from PCGRL as the latent vector of MarioGAN is used to
represent the agent’s states and actions for each level segment which is faster than the
approach by Khalifa et al. [2020].

Jennings-Teats et al. [2010] proposed an online level generator which changed its
difficulty based on the players skill level in real-time. The game of choice was Poly-
morph, a 2D platformer game which adjusts its levels during playtime using dynamic
difficulty adjustment (DDA). This implies that segments of the level need to be gener-
ated as the player traverses further and further, however the difficulty of each segment
is changed to be appropriate to the skill level of the player. The reasoning behind cre-
ating a DDA is to attempt to avoid the player being frustrated or bored with the game.
This approach has elements of search-based algorithms, however it’s described as a ma-
chine learning-based solution to dynamic video game level generation.

By using quality diversity, Gravina et al. [2019] proposed a method of PCG by using
a subset of the work by Togelius et al. [2011]. The reasoning as to why Gravina et al.
[2019] used a quality diversity approach to PCG is that in many PCG problems, there
is a tradeoff between the quality and the diversity of the generated content. A quality-

9


diversity (QD) algorithm is one in which the quality and diversity of the solution is
simultaneously maintained by rewarding divergence and using constraints to control
the solution’s quality. Gravina et al. [2019] ran experiments with multiple different use
cases and QD algorithms to generate a range of different video game content. Only the
approach used to generate level content will be focused on in this paper. A MAP-Elites
algorithm was used by Gravina et al. [2019] to generate levels in Super Mario Bros.
A feature map of 256 cells was generated, and the algorithm could generate levels in
an average of 100 cells where levels with and without game mechanics were generated.

One aspect of PCG which has not been investigated is the lack of structure to the
content that is pleasing to humans. This is because assigning a score to a video game
level based off of a human preference and feeding it back into PCG is a challenging task.
Similar methods have been used in a non-PCG setting. One such method which is used
for predicting the safety of streets based off of Google Streetview images was proposed
by Raskar et al. [2015], where a TrueSkill score is assigned to each image based on
the results of 208738 pairwise comparisons. These scores were then used along with
detected features from the Streetview images to train a v-SVR model in order to predict
the safety of unseen streets.

2.3 Research Framework

This section provides an overview and background information relating to each of the
steps in the proposed implementation.

2.3.1 TrueSkill

In order to provide a numeric value to the aesthetics of a video game level, imagine that
each video game level is a contestant in a competitive tournament. Within this tour-
nament, the winner is decided based on a human’s visual preference when comparing
one level to another. A way in which to reliably provide a numerical value to each
contestant so that a ranking for each contestant can be determined is to use TrueSkill.

Trueskill is a probabilistic skill rating system which was developed for online gam-
ing which assumes players in a game have respective skill levels, w1 and w2, and that
game outcomes can be predicted. This is achieved by using the performance difference
between skills which are subject to Gaussian noise effects [Herbrich et al. 2007; Minka
et al. 2018].

TrueSkill is widely used by Xbox Live for competitive games, such as Halo and Forza
Motorsport, as it allows for players with similar skill levels to be matched against one
another for a level playing field. The rating system uses two numbers in order to cal-
culate a players skill: average skill of a player, µ, and the degree of uncertainty in a

10


players skill, σ. By using a Guassian distribution a player’s skill can be calculated to be
between µ ± 3σ with a 95% confidence [Herbrich et al. 2007; Minka et al. 2018]. The
Guassian distribution used starts from N (25, 25

3

2
), which sets a new player to have a

mean skill of µ = 25 and an uncertainty of σ = 8.333 [Herbrich et al. 2007; Minka et al.
2018]. Figure 2.2 shows how the belief of the TrueSkill rating system is drawn.

Figure 2.2: TrueSkill Gaussian belief curve [Herbrich et al. 2007; Minka et al. 2018]

In situations where the uncertainty is high, the system does not know exactly what
the player skill should be, whilst a low uncertainty means that the ranking system
strongly believes the player is close to the average. By using the degree of uncertainty,
the system can make large changes to the skill estimates at early stages of new players
calibration matches, however after multiple games only small changes in skill can be
made. This allows the rankings system to identify the skill levels of players from a small
number of games. Depending on the game mode and number of players, the ranking
system can take long or short to determine a player’s skill. By expanding on TrueSkill,
TrueSkill 2 can determine the skill of a new player much faster, with an improved over-
all accuracy from 52% to 68%, however, TrueSkill 2 is not used in the research being
conducted [Minka et al. 2018]. TrueSkill 2 was not used in this research as the trueskill
Python library does not currently support TrueSkill 2; therefore implementing TrueSkill
2 from scratch would have been beyond the scope of this research.

A question that comes to mind is why use TrueSkill over other rating systems avail-
able? TrueSkill was developed for online gaming, however it is directly applicable to the
ranking of images in a competitive type system, such as pairwise comparisons, which
create one-versus-one match ups [Herbrich et al. 2007; Minka et al. 2018; Raskar et al.
2015; Burke 2016]. A 2D video game level can easily be transformed into an image,
making TrueSkill directly applicable to generated 2D video game content.

TrueSkill when used in a two-player contest uses the equations shown in equa-
tion 2.1 in order to update the scores and uncertainties of player p1 and player p2,
given the assumption that player p1 is the winner. In equation 2.1, t = µp1 − µp2 and
c2 = 2β2 + σ2

p1 + σ2
p2. The probability of a draw is given by the parameter ε, v(t, ε) is

a weighting factor for the players skill level and w(t, ε) is a weighting factor for the

11


uncertainty [Liu et al. 2013].

µp1 = µp1 +
σ2
p1

c
· v
(
t

c
,
ε

c

)
µp2 = µp2 −

σ2
p2

c
· v
(
t

c
,
ε

c

)
σ2
p1 = σ2

p1 ·
[
1−

σ2
p1

c2
· w
(
t

c
,
ε

c

)]
σ2
p2 = σ2

p2 ·
[
1−

σ2
p2

c2
· w
(
t

c
,
ε

c

)]
(2.1)

2.3.2 Learning a Model

A model needs to be learnt that can map video game levels to a predicted aesthetic
score. One method of doing this is to first compute image features from the levels and
using the feature data to train a predictor model. 2D video game levels can be portrayed
as 2D images which allows for computer vision image feature detection techniques to
be used. Variations of the feature detection techniques, listed below, have been used in
past research for scene recognition and passing feature data into a predictor. Although
past research made use of natural images, the techniques should perform similarly on
video game levels which makes them ideal techniques to potentially use for the research
being conducted [Raskar et al. 2015; Xiao et al. 2010]:

• Histogram of local binary patterns (LBP)

• Speeded-Up Robust Features (SURF)

• Scale-invariant feature transform (SIFT)

• Histogram of Oriented Gradients (HOG)

LBP compares the current pixels value to each of is neighbouring pixels. If the
neighbouring pixel’s value is greater than or equal to that of the current pixel, assign
the neighbouring pixel a 1. For any neighbouring pixel values less than the current
pixel value, a 0 is assigned. A histogram is then created for the frequency of each value
and used to form a feature vector [Ojala et al. 2002]. SURF is performed by approx-
imating second order Gaussian derivations by using integral images that are created
from a set of 2D box filters. This step is followed by a scale invariant blob detector
which is uses to find the keypoints in the image, from which the image descriptors are
described by the distribution of Haar wavelet responses near each keypoint [Mai Thanh
Nhat Truong 2016]. A popular feature detection technique is SIFT in which keypoints
are detected by finding areas with large gradients in more than one direction, based
off of the Difference-of-Gaussian (DoG) operator. For each keypoint detected, a feature
vector is created by using a histogram of the gradient orientations in a local area around
the keypoint [Mai Thanh Nhat Truong 2016]. HOG divides the image into small cells
and computes the magnitude and gradient for each pixel in the cell. This information

12


is then aggregated into a histogram of oriented gradients bin for each cell [Bertozzi et
al. 2007].

A potential second method for feature extraction from the input video game levels
is to use an neural networks in the form of an auto-encoder. This consists of encoding
the input image into a vector and then decoding the vector back into the original image
whilst minimizing the error produced when reconstructing the original input image.
An issue with this method is that it requires a large amount of data in order to train the
encoder and focuses on quantity of data in the video game level rather than important
data in the video game level. Video game levels have a relatively low amount of data,
but a high amount of key features which directly impacts the aesthetics of the level.
Due to how video game levels are comprised and the issue of requiring large amounts
of training data for auto-encoders, the approach of extracting feature data using more
traditional techniques is the preferred method in this research.

Once the various features for the images have been computed, a regression model
can be trained which maps the features to the TrueSkill score. This model can then be
used to predict the TrueSkill score for any new unseen video game level images as it
takes an image feature vector as an input, and outputs a predicted TrueSkill score. The
predicted score can then be used in the fitness function of a GA.

In supervised machine learning, there are a few different types of regression models
each of which have their own advantages and disadvantages. This research decided to
use support vector regression (SVR) due to its advantages of being able to generalize
well for smaller datasets, such as the feature vectors, whilst providing a high accuracy
in its prediction capabilities and providing a way in which the model variables can eas-
ily be configured for each application. Additionally it allows for the maximum number
of datapoints to be used within the hyperplane which ensures the majority of the small
dataset is accounted for which allows for subtle patterns within the dataset to be iden-
tified [Najwa Mohd Rizal et al. 2022].

SVR is an extension to the popular supervised machine learning tool support vector
machine (SVM), first introduced by Vapnik [1995]. SVM is commonly used in classifi-
cation based solutions whilst SVR is more commonly used in regression based solutions
[Smola and Schölkopf 2004; Hsu et al. 2003]. In order to predict a TrueSkill score for
an unseen level, a regression model can be used where the unseen level feature data is
taken in and used as the basis for predicting a TrueSkill score.

There are several parameters used in SVR which can be optimised in order to im-
prove the accuracy of the model, some of which are discussed in detail. These parame-
ters are:

• C: the regularization parameter

13


• γ: the kernel coefficient

• Kernel: the kernel type

The regularization parameter, C, controls the tradeoff between minimizing the er-
ror between the training and testing data, thus determining how generalized the model
is when it comes to unseen data [Smola and Schölkopf 2004; Hsu et al. 2003]. As this
research requires accurate prediction on unseen feature data, selecting the correct C
can make or break the model. One method of ensuring C is tuned correctly is to use
cross-validation to prevent overfitting, or a grid search [Smola and Schölkopf 2004;
Hsu et al. 2003]. The γ parameter directly affects the kernel being used and primarily
has two options, the first being scale which uses a value of 1

(num features∗X.var())
and the

second being auto with a value of 1
num features

where num features is the number of
input data features and X.var() is the variance of the input data features. Similarly to
C, cross validation and grid search are commonly used to tune γ [Smola and Schölkopf
2004; Hsu et al. 2003]. The last parameter that is optimised is the kernel, which in the
case of this research, alternates between linear and radial basis function (RBF). An RBF
kernel is a complex polynomial kernel which maps non-linear separable data samples
into a higher dimension, which means that it is the preferred kernel when the data has
a non-linear relationship [Hsu et al. 2003]. The linear kernel is preferred when the
relationship between the data is linear. Often at times the relationship of the data is
linear, however the RBF performs better and is thus chosen over the linear kernel. This
is due to the linear kernel being a special case of the RBF kernel [Hsu et al. 2003].

Raskar et al. [2015] made use of an improved SVR by Scholkopf et al. [2000] called
v-SVR. The benefit of v-SVR over a normal SVR is that it provides a variable, v, which
allows the number of predictions to be changed depending on the value of v and ε,
essentially giving some slack to the regression model [Raskar et al. 2015; Scholkopf
et al. 2000]. In SVR a regression function, f(x), can be calculated using equation 2.2
where x is the input feature vector and y are the corresponding labels. In order to
estimate functions using equation 2.2 the following is done. For each xi an error of
ε is allowed, where everything above ε uses slack variables ξ

(∗)
i to be captured. These

slack variables are penalized in an objective function using a regularization constant,
C, given a priori. Then, by minimizing equation 2.3 the accuracy of the approximation
can be determined [Scholkopf et al. 2000].

f(x) = (w · x) + b w, x ∈ RN , b ∈ R (2.2)

minimize τ (w,ξ(∗), ε) =
1

2
||w||2 + C ·

(
vε+

1

l

l∑
i=1

(ξi + ξ∗i )

)
subject to ((w · xi) + b− yi ≤ ε+ ξi

yi − ((w · xi) + b) ≤ ε+ ξ∗i

ξ
(∗)
i ≥ 0, ε ≥ 0

(2.3)

14


It was proved by Chang and Lin [2002] that if you were to use epsilon supported
vector regression (ϵ-SVR) with the parameters (C

l
, ϵ), it will have the same solution as

v-SVR with the parameters of (C, v), where l is the domain of the input data, thus show-
ing that using ϵ-SVR for this research with the correct parameters is a viable method
and equation 2.2 can also be used to represent ϵ-SVR.

Various approaches from literature have been discussed on PCG in video games,
such as genetic algorithms and machine learning based methods. Although plenty of
work has been done in the field, PCG is still not widely used. This chapter discussed
a non-PCG based method which can be translated into a PCG context for assigning
a score to an unseen video game level based off of its aesthetics. The theory behind
TrueSkill and the way in with an SVR model can be learnt from image feature data has
been explored in this chapter. In the next chapter the goals of the research and the
implementation that will be followed in order to solve the lack of aesthetically pleasing
PCG based video game levels using a GA is addressed.

15


Chapter 3

Research Approach

Using the background information discussed in Chapter 2, it is evident that although
many of the methods can procedurally generate video game level content that adheres
to a playable level, none of these methods take into account the level design aesthetics.
This issue arises due to PCG methods focusing their fitness functions on characteris-
tics such as playability, completeness and level difficulty. Each of these characteristics
are quantifiable; however it is far harder to quantify what is meant by an aesthetically
pleasing video game level. Thus this research is aimed at providing high level human-
in-the-loop advice to GA based PCG. It is important that the advice provided by the
human is high level, in order to allow anyone to provide insight, whilst being infre-
quent, such that the human is not constantly required to provide the advice.

This chapter sets out the goals to be achieved along with the experimental design
and choices made for the research conducted.

3.1 Goals

The purpose of this research is to determine whether taking in human advice, convert-
ing it to a fitness function and then applying it to a genetic algorithm can result in more
aesthetically pleasing procedurally generated video game levels, based on the specified
human input, whilst still adhering to video game level characteristics.

The research being conducted proposes a way in which aesthetically pleasing levels
can be quantified. To accomplish this, the research proposes using pairwise compar-
isons which will be shown to humans or virtual humans to allow them to select one
of the two levels they find more aesthetically pleasing. Virtual humans in this research
refer to code which will perform the role a human typically would in the pairwise
comparisons. By using code, a human will not be required which allows for large
scale experiments to be run quickly. The question then is how will pairwise compar-
isons transfer into a quantifiable measurement? The goal of this research is to use the

16


TrueSkill ranking system to systematically rank the levels based on a one-versus-one
competition, being the pairwise comparisons. Thus the level the human prefers will
be considered the winner of the one-versus-one competition and will be given a higher
TrueSkill score compared to a level which is not found aesthetically pleasing. Following
this, a predictor will be trained on the TrueSkill score data along with the corresponding
level feature vector to predict TrueSkill scores for unseen levels. By using the predictor
as a fitness function to the GA, the fitness can be optimised along with any additional
fitness functions, such as solvability, in order to create a bias towards generating levels
that a human would aesthetically prefer whilst still satisfying standard video game level
metrics. This fitness function is referred to as the human fitness in this research.

The reasoning to include computational metrics, such as solvability, along with the
human fitness is that this research did not want to merely create an interactive genetic
algorithm (IGA) that constantly requires human input. But rather a GA that is built
up of both a computation and interactive GA which requires a small amount of human
input and show how they can work together.

An additional question which this research aims to answer is whether or not using
human comparisons additional times during the PCG life cycle will improve the aes-
thetics of the levels being created.

3.2 Hypothesis

Using human advice for pairwise comparisons will provide a way in which to system-
atically rank unseen video game levels that allows for a optimisation objective which
can be integrated into a fitness function, which will improve the aesthetics of the video
game level whilst still adhering to set performance characteristics.

3.3 Experimental Design

In order for the presented hypothesis to be investigated and proved, experiments needed
to be conducted. This section summarizes each aspect of the experimental framework
used in order to conduct the necessary experiments and generate the results discussed
in Chapter 4.

To reduce the risk of using human input and evaluation, a proof of concept will
be developed prior to using real people to show that the method of developing levels
based on human preference does in fact work. This will be done by constructing a
virtual human and specifying a preference in level features. The pairwise comparison
step will occur, but it will be completed by code using the binary level data, rather than
a human selecting a level manually. This allows for ‘human’ advice to be added to the

17


PCG loop, without actually using a human.

Figure 3.1 provides a high level overview of the experimental framework and how
the PCG loop will incorporate an SVR model to predict a human fitness value for un-
seen video game levels. For the very first iteration of the GA, the SVR model object does
not yet exist, thus it is important that a random initial population of levels is generated
which will aid in the creation of the model. The initial dataset will be randomly gener-
ated using a genetic algorithm; this way a large amount of level data can be generated
without the need of using external resources. Pairwise comparisons to find preferred
aesthetic features will then be done on this initial set of data, in the form of TrueSkill
scores, along with feature vectors to describe each level. This data is preprocessed and
used to train an SVR model which is passed back into the human fitness function of the
GA. Once the model is passed back into the human fitness function, it is used to predict
a TrueSkill score for each of the subsequent levels generated at each iteration of the GA.

An assumption being made is that the SVR model should improve each time pair-
wise comparisons are done as the regression model will be refitted to correctly distin-
guish more reasonable human levels.

3.3.1 Genetic Algorithm

The goal of the genetic algorithm is to optimise the human fitness function, whilst still
adhering to the solvability fitness function. This means that although human fitness is
the focus, the solvability fitness is just as important as newly generated levels should
be both solvable and prefer the chosen human components.

The genetic algorithm implemented in this research is adopted from a general ge-
netic algorithm developed by and used in Beukman et al. [2022a]. The implemented
GA takes in a game level object and a fitness function. The GA aims to maximize the
fitness function and selects the best performing individual, based on the fitness func-
tion, in order to select the best performing individual to breed with to form the next
population. In the GA breeding step, the best performing individual is kept whilst the
rest are generated using a simple two point crossover with the parents and a mutation
probability.

The pseudocode for the steps required for incorporating human fitness into the GA
to allow for PCG is shown in algorithm 2. Each of the steps taken are discussed in detail
as well as design choices made.

One of the conditions of the research hypothesis is for the generated level to adhere
to performance characteristics, whilst improving in its aesthetics. In order for this to
occur, multiple fitness functions are required to be input into the GA. In order to com-

18


Figure 3.1: Flow diagram showing a high level overview of the experimental structure
for the PCG process.

bine multiple fitness functions together, a weighted sum is combined with the fitness
functions to create an overall fitness which is passed back into the GA. The method of
using a weighted sum has been found to be comparable to multi-objective optimisation
techniques [Beukman et al. 2022a; Gomes et al. 2015]. This overall fitness is essential
for the improvement of the GA, but it is important to note that the individual fitnesses
are also returned in order to record the performance of the GA during its life cycle.

In order for the human fitness function to be created, a set of initial population
levels is required. This is because an SVR model cannot be developed without initial
data. As such, the GA creates a completely random population, of a size specified by
the user. Pairwise comparisons are applied to each level in the random population in
order to assign TrueSkill scores to the data. The feature vectors for each population
level are extracted and passed into the SVR object, along with the TrueSkill scores as

19


Algorithm 2 Procedural Content Generation with Human Fitness and Solvability

Require: pop size←Population size
Require: ga iter ← Number of GA iterations

1: hf weight← Human fitness weight
2: solv weight← Solvability fitness weight
3: Initialize CombinedFitness()← Fitness functions and weights
4: Initialize GeneralGAPCG() object← Game level and CombinedFitness()
5: model← None
6: for i in range(ga iter) do
7: if i == 0 then
8: Generate a random population of size pop size
9: get trueskill scores()← Calculate TrueSkill scores for each population level

10: generate features()← Extract features from each population level
11: X,y ← preprocess data()
12: model← create SV R model(X,y)
13: Update human fitness with optimised SVR model object
14: end if
15: Perform an iteration of the GA
16: Record best level fitness function values
17: end for
18: best level← Save best level

the training data for an SVR model.

Once a model has successfully been created, it is passed into the human fitness cal-
culation to be used on unseen levels generated by the GA. The GA can then be run with
the inclusion of a human fitness function for N number of iterations, N being specified
by the user, and the best performing level of the GA is output. For each N, the GA runs
for 100 iterations and selects the best performing individual to breed the next popula-
tion. This individual is recorded and used to track the performance at each iteration of
the GA.

3.3.2 Pairwise Comparisons

A pairwise comparison is when two images of levels are compared to one another and
one of the images is preferred over the other in an aesthetic aspect. In order for the
pairwise comparisons to be done by humans, we would treat the comparisons as a
one-versus-one game. The game will show a pair of images to each human who are
simply required to select either the left image or the right image, depending on their
preference. The image that is selected will be deemed the winner, and the one that was
not selected will be the loser. Based on many of these games being played with mul-
tiple different pairs of images, a ranking will be assigned to each image so that there
is an overall outline of which images looked more aesthetically pleasing and which did
not. Multiple games are played in order to provide enough data to train an acceptable

20


model. This approach is similar to how ranking is determined in a one-versus-one tour-
nament structure.

Using an initial random population, humans or virtual humans, will be required
to perform the pairwise comparisons on the population before a predictor can be de-
veloped. When using real humans, ethical clearance is required however no personal
information of the humans will be recorded, only the outcome of the pairwise compar-
isons. The ranking system behind the pairwise comparisons will be TrueSkill, which
is discussed in detail in Section 3.3.3. From the results of the pairwise comparisons,
a regression model will be developed to predict the TrueSkill scores for unseen levels
created by the GA.

For future work, where real humans are used, a way in which to perform the pair-
wise comparisons is to use a simple website which will allow the human to select the
left or right image and then moves onto a new set of images until all of the pairwise
comparisons have been complete.

3.3.3 TrueSkill

TrueSkill is a probabilistic skill based rating system which is used to assign a numerical
value to each of the generated levels, be it in the initial GA population or during the
GA generation process. Each level in the population is modeled by N(µ, σ2) [Herbrich
et al. 2007]. The calculation of the scoring is determined based on the features in the
level being favoured by the human in a one-versus-one competition setting. Although
TrueSkill does cater for a draw in a one-versus-one setting, a design choice was made
to not implement the ability for two levels to draw. This design choice was chosen
as having levels drawn would not provide any benefit to the outcome of the overall
TrueSkills of each level. For example, in the virtual human case, if two levels con-
tain the same number of black border pixels, the first level will be deemed the winner.
Due to the nature of the round robin tournament, levels with identical virtual human
features will be very similar in TrueSkill scores proving a draw condition is not required.

The TrueSkill score for the winning level is increased by its degree of uncertainty,
σ, whilst the TrueSkill score of the losing level is decreased by its σ. The degree of
uncertainty is a unique value for each level, which allows for large changes in TrueSkill
scores for levels with a low number of games, and much lower changes the closer the
level gets to its perceived skill. As per the documentation provided by Microsoft, in
order for a stable TrueSkill score to occur for a level, a minimum of twelve contests is
required [Herbrich et al. 2007]. However, in the case of a round robin tournament that
contains more that eight contestants, the use case in this research, a minimum of only
three games is required for a stable ranking [Herbrich et al. 2007].

21


As previously discussed, a round robin tournament style was chosen for the virtual
human TrueSkill calculations. A round robin tournament is a tournament in which
every competitor players one another once. This design choice was made for the vir-
tual human implementation for two reasons. The first being that as it is virtual, a
round robin competition can complete quickly, which leads into the second reason; of
allowing for enough games to be played in order to ensure the degree of uncertainty is
minimized for each competitor.

Once the round robin tournament has completed, the TrueSkill scores are stored
so that they can be passed into the data preprocessing step before the SVR is trained.
Due to the nature of the TrueSkill algorithm in estimating level skill, rather than level
rank [Herbrich et al. 2007], a negative TrueSkill score can occur for a badly perform-
ing level. This negative score can transfer to the SVR model which in turn produces
negative probabilities which cannot be used in the GA. As a way to ensure that the GA
worked correctly for negative predicted scores, clamping is used which sets any neg-
ative predicted TrueSkill to 0. An alternative approach to clamping would be to scale
the TrueSkill scores such that they could never go below 0. An issue with scaling the
TrueSkill at this step is that it creates an additional tuning factor which would prefer-
ably be avoided to reduce the complexity in the code. As such, the weight of the human
fitness can be altered easily which is used to scale the TrueSkill.

3.3.4 Feature Detection and Extraction

The previous section discusses how this research assigns a TrueSkill score to a level
by comparing its aesthetics to other levels. The question that needs to be answered
is: How will a TrueSkill score be predicted for an unseen level if there is nothing to
compare the unseen level to? While neural networks seem like the natural choice to
answer this question, we do not have a lot of data thus we will go for the method of
extracting feature data from population levels and use that to train a regression model
in which to predict the TrueSkill score.

There is a wide range of feature extraction techniques that have been researched
extensively and used on a magnitude of different images and image types, but not many
have been used on 2D video game levels. As such, there is currently no known research
conducted which investigates the best feature extraction techniques for video game lev-
els. Thus, this research implements five well known feature extraction techniques in
order to find the best performing technique to extract feature descriptors from 2D video
game levels and pass them into an SVR model. Some of the techniques chosen have
been used in similar research focused on scene recognition, where they are used to
extract features from natural images and then passed into an SVR model [Raskar et al.
2015; Xiao et al. 2010]. This step in the process is essential as the better the extracted
features from the levels, the better the SVR model will perform on unseen data.

22


The five techniques implemented are:

• Scale Invariant Feature Transform (SIFT)

• Histogram of Oriented Gradients (HoG)

• Local Binary Patterns (LBP)

• Oriented FAST and Rotated BRIEF (ORB)

• Binary Robust Independent Elementary Features (BRIEF)

Each of these methods extract feature keypoints and descriptors from an input im-
age and output them into a feature vector. The size of these feature vectors vary based
on the method used; more powerful methods, such as SIFT and HoG, can detect 10000
features and more. With so many features and varying feature vector sizes a design
choice was made to standardize the feature vectors and scale them down to a more
manageable level whilst still providing enough information to not negatively affect the
SVR model. This standardization was done by implementing a bag of visual words
(bovw) approach to the feature extraction techniques which creates a frequency his-
togram for the features.

The bovw implementation is described by the pseudocode in algorithm 3. It uses
the detected feature descriptors from each input image and appends them to a descrip-
tor list vector. Using all of the descriptors from each input image, K-means clustering
is used to create a dictionary of the image descriptors. A KMeans object is used with
the default value of 10 for the number of times the k-means algorithm is run on differ-
ent seeds. K-means clustering is computed for the feature descriptor list and the visual
words vocabulary is created using the cluster centers. A default k value of 300 was
chosen for each of the feature techniques implemented as quantizing the descriptors
into 300 visual words provides enough information for the SVR without negatively af-
fecting its computational performance. k = 300 was used in prior work which provided
motivation for the quantization size [Raskar et al. 2015; Xiao et al. 2010]. Once the
visual words vocabulary is created, the same feature detection method is used on an
input image to extract its feature descriptors. In order to assign the descriptors to their
specific vocabulary, vector quantization (vq) is used, which compares the descriptor
with the centroids contained in the visual words dictionary and assigns it to the closest
centroid. This process creates a frequency histogram specific to the input image which
drastically reduces the amount of data contained in the feature vector.

3.3.5 Support Vector Regression

Before the SVR model can be created, the level TrueSkill scores and corresponding level
feature vectors for each population level need to be aligned. The feature vector dataset
and TrueSkill score dataset are merged together and then split into training data, X
and y, which is used to train the SVR model.

23


Algorithm 3 BOVW Feature Detection

Require: imgs← List of images
Require: method← Feature extraction method to use (str)
Require: k ← Number of clusters

1: descriptor list← []
2: feature vectors← []
3: for i in imgs do
4: Load in image i
5: Extract keypoints and descriptors of i using method
6: descriptor list.extend(descriptors)
7: end for
8: vis words← kmeans(k, descriptor list)
9: for j in imgs do

10: Load in image j
11: Extract keypoints and descriptors of j using method
12: img hist← create bovw(descriptors, vis words, k) ▷ Create bovw
13: Add img hist to feature vectors
14: end for
15: Scale feature vectors
16: Return feature vectors

For initial testing of the SVR model, the preprocessing function split the data into
a testing and training dataset with a ratio of 0.3. This approach is commonly used as
it provides a way in which to train a model on some data, the training set, and test it
using unseen data, the test set. It also provides a way in which to calculate metrics,
such as Root Mean Squared Error (RMSE), which estimates a model’s accuracy. For this
research, the design choice was made to not split the training data, as the more training
data the better the model should perform on unseen data which will be more beneficial
to the research being conducted.

Based on the research conducted by Hsu et al. [2003] it was found that correctly
scaling the training dataset for SVM type models before being applied to the model
greatly improves performance as it prevents larger numeric ranges from dominating
the smaller ranges. To avoid this potential problem, a design choice was made to scale
the feature data into a range between [0, 1] before storing the feature data. This range
is one of the recommended ranges by Hsu et al. [2003].

To create the model, a base SVR object is created first, and then fitted to the pre-
processed training data being passed into the function, X and y. The initial fit of the
data to the SVR in most cases is not optimal causing a poor SVR model to be created. In
order to improve this model so that it can accurately predict TrueSkill scores for unseen
levels based off of level feature vector data, the hyperparameters for the model need to
be optimised.

24


In order to optimise parameters of an SVR a grid search is performed followed by a
cross validation step, as stated in Section 2.3.2. The choice of the parameter grid was
an important design decision because having a grid too large will negatively impact
the time taken to optimise the model, whilst a grid too small may not contain optimal
parameters.

The predicted output of the SVR model can directly be used in a fitness function,
referred to as the human fitness function. This predicted output will be a TrueSkill
score for a video game level and will allow it to be used as an additional objective for
which the GA can optimise.

3.3.6 Fitness Functions

The fitness functions that are passed into the GA have been mentioned briefly in previ-
ous chapters. This section plans to expand on the fitness functions used and how they
are calculated as they are an important part of the research being conducted. For each
of the fitness functions, the objective of the GA is to maximize them.

Solvability

Solvability in a video game environment is a binary question with an answer of either
yes, a level can be solved, or no, a level cannot be solved. The solvability fitness func-
tion in this implementation outputs a 0 for unsolvable levels and a 1 for solvable levels.
As an example, assume there is a Mario level that has a pillar of unbreakable blocks
with no possible way over or through them which are blocking the player from reach-
ing the finish line. This would be considered an unsolvable level, thus the level will be
given a solvability fitness score of 0. The fitness function determines the solvability of
a level by using the A∗ search based method by Hart et al. [1968].

Human Fitness

The human fitness function is the dominant aspect of this research as everything that
has been developed is used in order to create a fitness function that provides a human
element into the GA. Algorithm 4 shows the pseudocode for some of the HumanFitness
class which calculates a human fitness for an input level; the important function being
calc fitness(). One other function, set model(), is not shown, but essentially updates
the model variable of the class to the model created from the population data. The
final answer variable is what is passed back into the GA which is the human fitness
objective the GA is required to optimise. As mentioned in Section 3.3.3, lines 6 and 7
show how the clamping is performed if a negative TrueSkill score for the input level is
predicted. Algorithm 5 is called within algorithm 4 and is how the SVR model actually
predicts a TrueSkill score for an unseen level. To do this, it must first extract the fea-
ture vector for the input level and then use the trained SVR model to predict a TrueSkill

25


score. For the case when a model has not yet been created, lines 1 and 2 in algorithm
5, the function outputs a random score between 0 and 1 so that the initial population
levels will be randomized. Without this step, each level in the initial population is the
exact same which negatively affects each step in the PCG loop, Figure 3.1. It is worth
nothing that optimising trueskill calc() provides the highest increase in the GA’s speed
performance as it is a step which is called multiple times within the GA.

Algorithm 4 calc fitness()← function in HumanFitness class

Require: levels← List of List of Levels
1: final answer ← []
2: for level group in levels do
3: total← 0
4: for level in level group do
5: trueskill for level = trueskill level(level)
6: if trueskill for level < 0 then
7: trueskill for level = 0 ▷ Clamping for negative predictions
8: end if
9: total+ = trueskill for level

10: end for
11: Append total/len(level group) to final answer
12: end for
13: final answer ← Return

Algorithm 5 trueskill level()

Require: level← An input Level
1: if model is None then
2: Random uniform value between 0, 1← Return
3: end if
4: feat vect← generate features(level) ▷ Generate feature vector for level
5: ts score← model.predict(feat vect) ▷ Predict TrueSkill score from feature vector
6: Return ts score ▷ Predicted TrueSkill score

This chapter discussed the various steps and design choices that are to be imple-
mented in order for the goals of this research to be achieved. Chapter 4 explores the
experiments conducted, along with the results gathered using the approach discussed
to assist in proving the hypothesis of this research.

26


Chapter 4

Results and Analysis

This chapter breaks down the details for each of the experiments conducted using the
proposed method outlined in Chapter 3 along with the results obtained for the experi-
ments. The results are analyzed and discussed.

Figure 4.1: Example of a Super Mario Bros level of size (114× 14)

Figure 4.2: Example of a Maze level of size (14× 14)

The experiments were conducted on two different 2D platformer video games, Su-
per Mario Bros, shown in Figure 4.1, and Maze, shown in Figure 4.2. For each game,
baselines were generated in order to show the effectiveness of each fitness function on
the GA. The parameters used for each of the experiments for their respective games are
discussed in their relevant sections.

27


4.1 Hyperparameter Optimisation

As discussed previously, there are parameters which can be optimised at various steps
in the PCG loop. For the fitness functions, a weighted sum of fitnesses is used in or-
der to capture multiple objectives as is done in prior work by Beukman et al. [2022a].
Thus, each individual weight determines how much influence the corresponding fitness
function has on the GA. This means that fitnesses with low weights will still be con-
sidered by the GA, but at a reduced effectiveness compared to fitness functions with
high weights. Human designers want levels that: (a) are solvable as unsolvable levels
are not very useful, and (b) levels that contain human preferred features. This makes
selecting the weights an important optimisation step for the fitness function. Equation
4.1 shows how the weighted sum is used in the overall fitness function, where α is the
solvability weight and β is the human fitness weight.

total fitness = α× solvability fitness + β × human fitness (4.1)

Empirically, it was found that the value of the weights caused the GA to priorities
one fitness function over the other. When β > 0.39, too much emphasis is placed on
the human fitness causing the solvability fitness to be negatively affected. In order to
ensure solvability was considered highly by the GA, an α = 10 and a β = 0.39 was
chosen.

In terms of the SVR and its performance, as discussed in Section 2.3.2, a grid search
is used in order to optimise the parameters. The parameter grid used in the experiments
conducted is shown in table 4.1. In order to verify that the selected parameters were
optimal, a cross validation of size 5 is done on the parameter grid. Once the optimal
parameter values are determined from the grid search, the SVR model is refitted to
the training data with the newly optimised parameter values. This optimised model is
what is passed back into the human fitness function of the GA and can vary between
experiments due to the nature and size of the training data. An example of the per-
formance of the SVR model before and after the optimisation step is shown in Section
4.2.3. The hyperparameter optimisation is a key step in the process, as the SVR exhibits
poor predictive capabilities when default values are used.

Table 4.1: Grid search ranges for each parameter

Parameter Grid Search Range
C {0.001, 0.01, 0.1, 1, 10, 100}
γ {auto, scale}
kernel {linear, rbf}
ϵ {0.1}

The default TrueSkill parameters values were used for µ and σ2 so that the Gaussian
distribution can be described as N(25, 25

3

2
). Due to the pairwise comparison tournament

style being a round robin, and the lowest population size being 20 in the experiments

28


discussed, enough games are played to ensure the correct skill is determined for each
level, as discussed in Section 3.3.3.

4.2 Maze

Maze is a 2D platformer based game made up of two tile colours, black and white.
The black tiles are considered ‘walls’ whilst the white tiles are considered ‘empty’. The
goal of the game is for the player to start on the top left white tile, coordinates (0, 0)
and find a path along the white tiles to reach the end of the Maze located at the bottom
right white tile, coordinates (14, 14). The player of the Maze may only move in the four
cardinal directions. A Maze level is considered solvable if a connected path, of white
tiles, exists between the starting tile and the end tile. A simple example of a 14 × 14
Maze is shown in Figure 4.3. A Maze can be scaled to any N ×N size, however in each
of the experiments discussed a Maze size of 14× 14 is used as it provides a sufficiently
large Maze for which the experiments could be conducted at reasonable speeds and
test the proposed PCG loop thoroughly. A reason for choosing Maze is that it has been
successfully used with genetic algorithms in prior work [Beukman et al. 2022ab].

Figure 4.3: Figure showing a randomly generated Maze level, of size 14× 14, followed
by the same level overlaid with detected SIFT feature keypoints. Circle size indicates
the scale of the keypoint and the lines indicate the keypoint orientation.

During initial experimentation for Maze each of the feature detection techniques
were used; however, the SVR did not sufficiently predict unseen levels as was to be
expected. It was determined that due to Maze only containing black and white tiles,
the feature detection methods, in which a feature vector of size 300 was created, were
unable to extract enough feature information which in turn caused the SVR to perform

29


Table 4.2: Table showing the binary version of the Maze level shown in Figure 4.3

1 1 1 0 1 1 1 1 1 0 0 1 1 0
1 1 1 1 0 0 0 0 0 1 1 1 1 0
0 0 0 1 1 1 1 0 1 1 1 1 0 0
1 0 1 0 0 0 1 1 1 0 1 0 1 1
0 0 0 0 1 1 1 0 0 0 1 1 1 0
0 0 1 1 0 1 1 0 1 0 1 0 1 0
0 0 0 1 1 1 0 1 1 0 1 0 1 0
1 1 1 1 0 0 1 0 1 0 0 1 1 1
0 0 1 1 1 1 1 0 1 0 1 1 0 1
1 1 0 1 1 1 0 0 1 0 1 1 0 0
0 1 1 0 1 0 1 0 1 0 1 1 0 0
1 0 1 1 1 0 1 1 1 1 0 1 0 1
1 0 0 0 0 1 0 1 1 0 1 1 0 1
0 0 0 0 0 0 1 0 1 0 0 1 1 1

poorly. As an example, Figure 4.3 shows the features detected, red circles, from a Maze
level with SIFT is used. As can be seen in Figure 4.3, there are very few features de-
tected, and the ones that are in arbitrary locations. Due to the small size of the Maze
levels, LBP did not perform well as the amount of feature data was too small to train an
accurate SVR model. In order to improve the SVR performance, each Maze level was
converted into a binary grid, of size 14× 14 which is shown in table 4.2, and flattened
into a feature vector of size 1× 196. The representation of the Maze level in table 4.2 is
the binary version of the level shown in Figure 4.3. Using the actual level data, rather
than feature extraction techniques, proved far more effective and is used throughout
the Maze experiments.

As stated previously, virtual humans are used in order to specific which aesthetic
features to prefer. For Maze, the virtual human is specified to prefer levels that contain
black pixels on its border. One of the motivations for this preference is that in prior
work, Beukman et al. [2022a], solvability in Maze levels was a result of a pathway ex-
isting around the border of the level, which caused the level to be solved without ever
entering into the centre of the Maze. This does not result in a challenging level, thus
having a preference that forces the pathway into the center of the Maze results in a
better overall level.

4.2.1 Baseline Human Fitness

Each of the experiments run in this section test using only the human fitness function.
This is achieved by setting α = 0 such that solvability is not taken into account by the
GA.

30


Figure 4.4 shows the results obtained from experiments in which a set number of
GA iterations of 100 was used for a range different initial population sizes. For each
population size, 5 different seeds were run and the averages of the seeds were used to
calculate the average human fitness function value, shown in orange, and the average
number of black border pixels, shown in green. The standard deviation for each value
are shown as black lines. It is evident that as the population size increases, so does
the average human fitness along with the number of black border pixels. For low
population sizes, the number of black border pixels is at its lowest, this is due to small
amount of training data for which the SVR is trained on. As the initial population size
increases, so does the amount of training data for the SVR which in turn increases the
number of black border pixels and human fitness due to the SVR performing better as
it has been trained with more data. The number of black border pixels however does
slow down to a stable value of around 40 at a population size between 200 and 500,
whilst the human fitness continues to grow. This is due to the SVR performing well due
to the large amount of training data, but due to the low number of generations the GA
is not able to significantly increase the number of black border pixels. It was found,
and discussed in Section 4.2.3, that increasing the number of GA iterations allowed for
the human fitness function to be optimised which increased the number of black border
pixels.

Figure 4.4: Graph showing the comparison between low and high initial population
sizes and their effect on the average human fitness and average number of black bor-
der pixels on the best generated level for 100 iterations; black lines indicate standard
deviation.

31


4.2.2 Baseline Solvability Fitness

In this experiment, only the solvability is used as the fitness to optimise by the GA. The
experiment compared the average solvability to the number of GA iterations run and
the effect optimising only solvability has on the average number of black border pixels.

Figure 4.5 shows the results obtained when a varying number of GA iterations is
used. For each iteration size, 5 seeds were run and averaged to determine how a vary-
ing number of iterations affects the level solvability, shown in blue, along with the
number of black border pixels, shown in green. Since human fitness is not being opti-
mised, it is expected for the average number of black border pixels to not significantly
change, which is the case from the results obtained. It is clear from Figure 4.5 that as
the number of iterations increases, so does the average solvability whilst its standard
deviation decreases. This shows that for higher iterations, the GA is able to generate
more solvable levels than unsolvable levels. For 20 iterations, only around 36% of the
levels generated were solvable. When looking at the number of black border pixels
the value remains relatively low, with a maximum of 26 for levels in which the major-
ity were not solvable. This indicates that in most cases, a level is solvable when the
number of black border pixels is lower than 46% of the total border pixels.

Figure 4.5: Graph showing the comparison between low and high GA iterations and
their effect on the average solvability fitness and average number of black border pixels
for the best generated level; black lines indicate standard deviation.

4.2.3 Maze Results

The majority of experiments conducted were on the combined fitness function for solv-
ability and human fitness as this is the main focus of the research being conducted. For

32


each of the experiments, a solvability fitness weight of α = 10 is used whilst a human
fitness weight of β = 0.39 is used.

Figure 4.6 shows three different Maze levels, each of which have been generated in
a slightly different manner. The first level in the figure, (a), was generated using only
the solvability fitness where it can be seen that the leve is solvable, but only contains
25 black border tiles out of a total 56 tiles. The second level, (b), was generated using
only human fitess which is why the level is not solvable. It contains far more black
border pixels, being 40 black border pixels which is 71.43% of the total border pixels
showing that the human fitness by itself does bias the generated levels towards more
black border pixels. The last level, (c), is generated using both the solvability fitness
and human fitness. The level is both solvable and contains more black border tiles, 32
out of 56, compared to the level shown in (a) which only focused on solvability. This
brief example is an indicator that the combined fitness does infact allow for a solvable
level whilst having a bias towards the preferred human features in a level. This section
will go into the results obtained from experiments conducted when both the solvability
fitness and human fitness are being optimised by the GA.

(a) (b) (c)

Figure 4.6: (a) Maze level generated using only the Solvability fitness function (25
black, 31 white). (b) Maze level generated using only the Human fitness function (40
black, 16 white). (c) Maze level generated using both Solvability and Human fitness
functions (32 black, 24 white).

An important point to note about Maze levels and the virtual human preference of
black border tiles, is that the start and end point for a solvable Maze level requires the
first tile, coordinates (0,0), and the last tile, coordinates (14,14), to be white. Due to
this, there is a contradiction as these tiles need to be white in order for a level to be
solvable, but the virtual human preference would rather they be black tiles. Due to
this contradiction and the need for solvable levels, a design choice was made to ensure
that each the top left tiles, coordinates (0,0), (0,1) (1,0), and the bottom right tiles,
coordinates (14,14), (13,14), (14,13), were always white tiles. By implementing this
requirement, it always provides the Maze a start and end point, as well as a way in
which to enter the centre of the Maze due to the Maze movement restrictions of only

33


the cardinal directions. This change does negatively affect the total number of black
border pixels; however, it was considered a tradeoff worth making in order to improve
the solvability of the levels being generated.

Maze Population Sizes

The experiments discussed in this section were conducted in order to determine what
effect the initial GA population size has on the predictability of the trained SVR model
for unseen levels. Since better prediction should result in a better human fitness func-
tion, this experiment also determines if higher initial population sizes increase the av-
erage number of black border pixels in subsequent levels generated by the GA.

Each of the subplots in Figure 4.7 provide detail into the effect the population size
has on the predictability of the SVR model. The subplots (a, c, e, g, i) show the SVR
model before the grid search has been used to optimise it, whilst subplots (b, d, f, h, j)
show the SVR model once it has been optimised with grid search and a cross validation
size of 5. The plots were generated using a test train split with a test split size ratio
of 0.3. As stated previously in Section 3.3.5, the test train split is only used for better
understanding of the model and no splitting of the data is done when running the PCG
loop as a whole.

From each of the subplots in Figure 4.7 the population size ranges from 20 to 1000,
and it is clear to see that a low number of population levels does not provide enough
training data to the SVR for it predict the TrueSkill for unseen data accurately. For large
population sizes, such as 500 and 1000 as shown in subplots (h) and (j) respectively,
the SVR model performs very well showing that more training data does significantly
improve the SVR model.

Each experiment is run for 100 GA iterations and a varying population size which
ranges from 20 up until 1000. The results from the experiments are averaged over 5
seeds and the standard deviation between the seeds is shown as the shaded area in
each graph. The value of 100 was chosen for the GA iterations as the solvability for
the generated levels requires a minimum of 100 iterations in order to achieve a 90%
solvability, as shown in Figure 4.5. In each graph, the baseline solvability and baseline
number of black border pixels is shown by a dotted line and they are used to indicate
an upper bound. This is to provide a comparison between when the GA only optimises
a single fitness function and when the GA optimises multiple fitness functions, as the
upper bound will be the maximum possible value when multiple fitness functions are
optimised. In each of the figures, we are most concerned about the green line, number
of black border pixels, orange line, human fitness, and dark blue line, solvability fitness.
As to be expected, the average number of black border pixels is less than that of the
baseline; however, the solvability is reached in each case making the slight decrease in
average black border pixels worth the levels being solvable.

34


(a) 20 (b) 20 (c) 100

(d) 100 (e) 200 (f) 200

(g) 500 (h) 500 (i) 1000

(j) 1000

Figure 4.7: (a, c, e, g, i) SVR model prediction before optimisation; (b, d, f, h, j) SVR
model prediction after optimisation.

Figure 4.8 shows how the GA performs with a population size of 20. It can be seen
that although the solvability for each generated level is constantly at a value of 1, the
human fitness and number of black border pixels fluctuates at low values. The average

35


number of black border pixels is around 23 which is only 41% of potential border pix-
els. This low number of black border pixels is caused by the SVR not receiving enough
training data due to the low population size, an example of which is shown in subplot
(b) of Figure 4.7, which causes the prediction of the TrueSkill scores to not be accurate.

Figure 4.8: Figure showing solvability fitness, human fitness and number of black bor-
der pixels averaged over 5 different seeds for 100 GA iterations with an initial popula-
tion size of 20; baseline plots are indicated by dotted lines; shading indicates standard
deviation.

By increasing the population size to 100, as shown in Figure 4.9, there is an im-
mediate increase in both human fitness as well as the average number of black border
pixels. Along with this increase, the fluctuations have become less frequent showing
that the GA is slowly biasing towards levels with black border pixels.

For a population size of 200, the solvability starts to fluctuate away from 1 for the
first couple of iterations, as shown in Figure 4.10. The number of black border pixels
and human fitness however are higher at a lower number of iterations, where the av-
erage number of black border pixels is around 31, when compared to previous results
that used lower populations. These values continue to increase as the number of itera-
tions increase.

Figure 4.11 shows that larger initial population sizes result in an increase to average
black border pixels at lower iterations, with an average of 33. At this population size
however the solvability is negatively affected as a stable average of 1 is only achieved
at around 38 iterations.

36


Figure 4.9: Figure showing solvability fitness, human fitness and number of black bor-
der pixels averaged over 5 different seeds for 100 GA iterations with an initial popula-
tion size of 100; baseline plots are indicated by dotted lines; shading indicates standard
deviation.

Figure 4.10: Figure showing solvability fitness, human fitness and number of black bor-
der pixels averaged over 5 different seeds for 100 GA iterations with an initial popula-
tion size of 200; baseline plots are indicated by dotted lines; shading indicates standard
deviation.

37


Figure 4.11: Figure showing solvability fitness, human fitness and number of black bor-
der pixels averaged over 5 different seeds for 100 GA iterations with an initial popula-
tion size of 500; baseline plots are indicated by dotted lines; shading indicates standard
deviation.

The last experiment for population sizes is shown in Figure 4.12. This is the worst
performing result in terms of solvability, as the solvability fitness is unstable. This in-
dicates that there is in fact diminishing returns for very large population sizes. The
number of black border pixels has the highest average out of all the experiments at
around 34 which is slightly higher than the average achieved with a population size
of 500. The standard deviation in the number of black border pixels is the lowest out
of all of the experiments, indicating that population sizes larger than 1000 for 100 GA
iterations will not significantly increase the number of black border pixels.

In conclusion to the the results from the experiments which are shown in figures 4.8
to 4.12 and the SVR model results shown in Figure 4.7 it can be determined that initial
population sizes between 200 and 500 perform the best for generating levels that have
a bias towards levels that contain the virtual human preference as well as maintaining
the solvability of the levels. For population sizes lower than 200 the SVR model does
not perform well enough for accurate predictions which results in the human fitness
being optimised towards the wrong features. Although the SVR model performs very
well for population sizes larger than 500, there is too much of a bias towards the vir-
tual human preference that it negatively affects the solvability of the level. Very large
population sizes thus cause diminishing returns.

38


Figure 4.12: Figure showing solvability fitness, human fitness and number of black
border pixels averaged over 5 different seeds for 100 GA iterations with an initial pop-
ulation size of 1000; baseline plots are indicated by dotted lines; shading indicates
standard deviation.

Maze Iteration Comparisons

In the previous experiment, we investigated the effect of varying the initial population
size. Now we look how the number of GA iterations affects the number of black bor-
der pixels for an initial population size of 200. The initial population size of 200 was
chosen due to it performing equivalently in terms of average number of black border
pixels to a population size of 500 and 1000 when only the human fitness function was
investigated as shown in Figure 4.4. A population of size 200 also provides enough
training data for the SVR model to perform well, which is shown in Figure 4.10 and
subplot (f) of Figure 4.7.

Looking at the results shown in Figure 4.13, for a set population size, an increase
in GA iterations increases the average number of black border pixels for generated lev-
els. As the GA runs for longer iterations, it is given more opportunities to