University of the Witwatersrand
School of Computer Science and Applied Mathematics

Faculty of Science

Learning Factored Skill Representations for
Improved Transfer

Matthew Cockcroft
(1415624)

Supervisors:
Prof Benjamin Rosman

Dr Steven James

A dissertation submitted to the Faculty of Science, University of the Witwatersrand,
Johannesburg, in fulfilment of the requirements for the degree of Master of Science.

August 2022, Johannesburg


Abstract

The ability to reuse skills gained from previously solved tasks is essential to building agents
that can solve more complex, unseen tasks. Typically, skills are specific to the initial task
for which they were learned to assist in solving and it remains a challenge to determine which
features between a set of two tasks need to be similar in order for that skill to apply to both tasks.
Current approaches have shown that learning generalised skill representations allows for the skills
to be successfully transferred and reused in accelerating learning across multiple similar new
tasks. However, these approaches require large amounts of domain knowledge and handcrafting
to learn the correct representations. We propose a novel framework for autonomously learning
factored skill representations, which consist only of the variables which are relevant to executing
each particular skill. We show that our learned factored skills significantly outperform traditional
unfactored skills, and match the performance of other methods, without requiring the prior
expert knowledge that those methods do. We also display our framework’s applicability to real-
world settings by demonstrating its ability to scale to a realistic simulated kitchen environment.

i


Declaration
University of the Witwatersrand, Johannesburg

School of Computer Science and Applied Mathematics
Senate Plagiarism Policy

I, Matthew Cockcroft, (Student number: 1415624) am a student registered for COMS8003A -
Masters by Dissertation in the year 2022.

I hereby declare the following:

• I am aware that plagiarism (the use of someone else’s work without their permission and/or
without acknowledging the original source) is wrong.

• I confirm that ALL the work submitted for assessment for the above course is my own
unaided work except where I have explicitly indicated otherwise.

• I have followed the required conventions in referencing the thoughts and ideas of others.

• I understand that the University of the Witwatersrand may take disciplinary action against
me if there is a belief that this in not my own unaided work or that I have failed to
acknowledge the source of the ideas or words in my writing.

Signature:

Signed on day of , 2022 in Johannesburg.

ii

August9th


Acknowledgements

First and foremost, thank you to Benji and Steve for all the guidance, support and under-
standing that you have shown me over the last two years. Knowing that you would be there
to meet with me and discuss my progress each week without fail has been a driving factor in
providing me with the motivation to persist through this dissertation. An additional thank you
to Steve for assisting me with CHPC compute, so that I no longer had to worry whenever the
cluster was down (which was often).

To Pravesh, thank you for getting me interested in reinforcement learning in the first place
and helping me to formulate the initial idea upon which this research was based.

To my friends and family, thank you for putting up with all my whinging and moaning,
whether it be research-related or not. I am greatly indebted to all of you for the time that
you have dedicated to me. Last, but not least, a big thank you to my parents for the patience
that you have shown me and for always being there to encourage me whenever I was feeling
downhearted. A particularly large thank you to my mom for taking the time to wade through
all the technical jargon while proof reading this dissertation.

This research has been supported in part by the National Research Foundation of South
Africa (Grant Number: 17808). Opinions expressed and conclusions arrived at, are those of the
author and are not necessarily to be attributed to the NRF.

iii


Contents

Abstract i

Declaration ii

Acknowledgements iii

1 Introduction 1

1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background and Related Work 4

2.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Hierarchical Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Factored Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Object-Oriented MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.2 Factored MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Decision Tree Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 A Framework For Learning Factored Skills 15

3.1 Learning Factored Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.1 Minigrid Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Refining Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Variable Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

iv


3.4 Evaluation of Transferability of Factored Options . . . . . . . . . . . . . . . . . . 23

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Experimentation 25

4.1 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 Baseline Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Initial Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4 Locally Observable Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4.1 Variable Groupings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.5 Redundant and Noisy Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.6 AI2-THOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Conclusion 44

5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Appendices 49

A Decision Tree Models 50

B Grid Size Experiments 52

C Redundant and Noisy Domain Experiments 53

D Hardware Specifications 55

v


Chapter 1

Introduction

The ability to leverage previously gained knowledge is crucial in accelerating the learning of
new tasks when designing intelligent agents [Taylor and Stone 2009]. However, one of the main
difficulties in this approach lies in learning initial knowledge which will be applicable to future
tasks. If the knowledge is too specific, then it will very rarely be useful for learning new tasks, but
if the knowledge is too general, then the performance gain from this knowledge will be minimal
in the current task. Consequently, it would be desirable for an agent to learn knowledge in a
representation that can generalise across multiple future tasks, while still being specific enough
so as to provide a significant acceleration in learning those tasks.

We focus on the skills learned when solving previous tasks, which can be thought of as the
knowledge that we wish to reuse when attempting to solve new tasks. The options framework
[Sutton et al. 1999] has shown to be successful in constructing reinforcement learning agents
which are capable of learning such skills. These skills are represented as action hierarchies, which
consist of primitive action sequences, which are chained together to form higher-order actions
or options. The effectiveness of these skills in new environments depends on the similarity
between the environment in which the skill was learned and the environment in which the skill
is being reused. However, if agents can learn more general representations of these skills, then
the required similarity between environments is reduced and these skills become applicable to a
greater number of tasks.

We can take inspiration from how humans are able to focus on a specific task and what is
required to solve it, while ignoring any extraneous information presented to them. For example,
consider an agent working in a kitchen environment, as shown in Figure 1.1. The agent has
sensors which provide it with various information, including the location of objects, such as the
frying pan, stove, sink and an apple. If the agent is given the task to cook an egg, it should
be interested in the location of the stove and the pan, but should be able to ignore the data
regarding the sink and apple. Once the agent has learned the skill of cooking eggs, it should then
be able to effectively reuse that skill in a new kitchen environment, even if some unrelated items
are no longer present or if the room colours and lighting varies from the original environment.

If we consider all of the sensory information available to the agent as the global set, then we
wish to learn a factored representation of this set, which would be a subset containing only infor-
mation which is relevant to the task which is being learned. The two most common approaches
in reinforcement learning for learning such representations are Factored Markov Decision Pro-
cesses [Boutilier et al. 2000] and Object-Oriented Markov Decision Processes [Diuk et al. 2008].
While both of these frameworks are effective at modeling factored domain representations, they
require knowledge of the structure of the environments to be known beforehand and rely heav-
ily on handcrafted features which need to be manually changed each time the environment is
updated. This process does not scale well, as it becomes significantly more time consuming as
the domain complexity increases.

1


As a result, the aim of our work is to build a framework which would allow for agents to
autonomously learn factored representations, without requiring prior domain knowledge and
expert handcrafting. Such a framework should be able to learn these representations quickly,
from few training samples, while the representations themselves should allow for the agent’s skills
to generalise across various similar tasks. This will ultimately lead to agents that can effectively
leverage the skills learned when solving one task to accelerate learning of new unseen tasks, and
successfully scale to more complex environments without requiring prior domain knowledge.

Figure 1.1: Example of a simulated kitchen environment. The different coloured blocks indicate
objects within the environment for which the agent has sensory data, but not all of these objects
may be important for the current task being carried out.

1.1 Contributions

To tackle the issue of laborious handcrafting of features and domain structures, we propose
a framework which is able to learn factored skill representations autonomously. We extend
the options framework, by introducing factored options which are represented in a way that
improves generalisability and transfer across domains. Our approach learns these factored option
representations using a continuous refinement process, in which the factoring is updated after
each new training domain presented to it. We show that the framework is able to converge
to a learned factoring quickly, without having to interact with an excessive number of training
domains.

We present empirical results which show that our learned factored options consistently out-
perform traditional options when evaluated on their ability to accelerate learning in new do-
mains. We also compare our framework to baselines that require extensive human expertise and
handcrafted domain knowledge. Our method is fully autonomous, and yet we demonstrate that
it is able to match the performance of those baseline methods.

2


We also demonstrate that our agent is able to adapt to domains containing several noisy
variables, similar to what is likely to be encountered in real-world settings. We show that in
such domains our framework is able to significantly outperform other baseline methods, despite
all of the prior expert knowledge they have been provided with. Finally, we also showcase
our framework’s ability to scale to a larger, realistic simulated kitchen environment, in which
our agent is able to learn factored representations of cooking-related skills, that allows it to
successfully transfer the skills to new kitchen configurations.

1.2 Structure

The structure of this dissertation is organised as follows: in Chapter 2, we detail the concepts,
notations and definitions upon which our work is built, as well as providing an overview of
the current work in this field. In Chapter 3, we discuss the methodology of this work and
elaborate on the key contributions, including how we create a framework for factored skill
representations and how we intend to utilise it to accelerate learning. Then, in Chapter 4 we
detail the experiments which we use to evaluate the performance of our framework compared
to other methods, elaborating on the experiment set-up and analysing the results from each
experiment. Finally, Chapter 5 summarises the contributions of this work and looks at future
avenues upon which it may be extended.

3


Chapter 2

Background and Related Work

In this chapter, we expand upon the idea of learning generalisable skills in reinforcement learning
and introduce the fundamental concepts required for implementing these ideas. We provide a
thorough explanation of reinforcement learning in Section 2.1, and then expand upon the field
of hierarchical reinforcement learning and how skills can be incorporated into reinforcement
learning domains in Section 2.2. We then detail the concept of factored representations and
their benefits with regard to generalisation in Section 2.3, specifically focusing on two methods
commonly used in conjunction with hierarchical reinforcement learning, namely Factored Markov
Decision Processes and Object-Oriented Markov Decision Processes. Finally, in Section 2.5, we
discuss relevant methods and evaluation criteria for transfer learning.

2.1 Reinforcement Learning

Reinforcement learning is a trial-and-error based learning paradigm in which a decision-making
entity, known as an agent, performs a set of actions in an attempt to maximise its numerical
reward. The agent repeatedly performs actions, receiving a reward after each action and transi-
tioning to a new state, until a terminal state is reached. A state s refers to all of the information
used to describe the environment at a particular timestep, some of which may not be known to
the agent.

Typically, reinforcement learning methods focus on problems that preserve the Markov prop-
erty, where all relevant information required to determine the agent’s transition after performing
an action is included in the state data it receives. More formally, the probability of transitioning
from state s to state s’ is only dependent on state s and the action performed to cause the tran-
sition, rather than depending on any of the preceding states. Sutton and Barto [1998] describe
how problems such as these can be modelled using a Markov Decision Process (MDP):

An MDP is a tuple defined as M = (S,A, P,R, γ), whereby:

• S is the set of all possible states of the environment

• A is the set of actions available to the agent

• P (s′|s, a) is the transition probability function, which describes the probability that taking
action a in state s will result in a transition to state s′

• R(s, a, s′) is the reward function, which is the reward obtained when transitioning to state
s′ by taking action a in state s

• γ ∈ [0, 1] is the discount factor, which determines how much future rewards are valued

4


The goal of reinforcement learning is to learn an action policy which maximises the cumu-
lative expected reward of the agent. An action policy π(a|s) gives the probability of an agent
choosing action a when in state s. To learn the optimal policy π∗ the notion of state values is
required. State value V π(s) is the expected return from following policy π when starting in state
s. To solve for the optimal policy, the value of state s can be defined in a recursive form using
the Bellman equation, where V ∗(S) = V π∗ is an optimal value function and π∗ is an optimal
policy:

V ∗(s) = max
a

∑
s′

P
(
s′|s, a

) [
R
(
s, a, s′

)
+ γV ∗

(
s′
)]

(2.1)

This optimal value function can then be solved using dynamic programming techniques such
as value iteration [Bellman 1957]. Value iteration works by first arbitrarily initialising the values
of V , and then sweeping through all state values and updating them using the Bellman equation
(2.1).

2.1.1 Q-Learning

The issue with value iteration is that for each state value update a one-step lookahead is required,
which become infeasible for large state spaces. An alternative is to make use of Q-values, which
map state-action pairs and are represented by Qπ(s, a). They represent the expected cumulative
reward obtained when performing action a in state s, and thereafter following action policy π(a|s)
until reaching a termination state.

Q-Learning [Watkins and Dayan 1992] is an algorithm which is used to iteratively learn an
MDP’s optimal Q-value function. The Q-value function is defined as Q∗(s, a) = R(s, a, s′) +
γV ∗(s′), where 0 < γ < 1 is the discount factor, and V ∗ represents the expected cumulative
reward for optimal policy π.

Before learning Q, all values are first initialised to an arbitrary fixed value. Then at a given
time-step t, the agent in state s executes action a and transitions to state s′, receiving reward r.
The update rule for the approximation of Q, which is denoted as Q̂, can then be calculated as:

Q̂(s, a)← Q̂(s, a) + α[R(s, a, s′) + γmax
a′

Q̂(s′, a′)− Q̂(s, a)] (2.2)

where α is the learning-rate. The agent is now able to learn this Q-value function by acting
directly with the environment, rather than having to iterate through all possible states and
actions [Watkins and Dayan 1992]. Once the Q-value value function has been obtained, it is
then trivial to solve for the optimal policy π∗ by acting greedily with respect to Q∗(s, a).

5


2.2 Hierarchical Reinforcement Learning

The use of single-step actions in a standard MDP, known as primitive actions, is typically
adequate for solving reinforcement learning problems in which there are abundant rewards.
Oftentimes though the reward is sparse and is only received upon completion of the task. Such
problems require long sequences of correct actions during exploration before a reward is received,
which is infeasible for larger state spaces. Hierarchical reinforcement learning is a form of action
abstraction which combats this problem by chaining sequences of low-level actions into high-level
skills which can be instantiated together, allowing the agent to take greater and more meaningful
steps through the state space.

For example, consider again the kitchen agent discussed in Chapter 1. Given the task of
cooking eggs, we may decompose this task into a set of smaller sub-tasks, such as walking to
the eggs, picking the eggs up and placing the eggs in the pan. To solve one of these sub-tasks,
such as fetching the eggs, we may chain all of the actions required into one temporally-extended
action. Once the agent has fetched the eggs it will need to decide on the next action or skill
to execute, but while it is executing the fetch eggs skill it does not need to think about each
individual action taking it towards the fridge. If the agent only receives a reward once the entire
task is complete and we are operating with low level actions such as motor torques to move the
agent, then it will take thousands of correct actions before a reward is received, which is not
feasible. In contrast, with higher level skills and task decomposition, the agent will only need
to make three correct decisions, before a reward is received.

A common approach to task decomposition is through the use of the options framework
[Sutton et al. 1999], which provides a formal definition of a skill in a reinforcement learning
setting. Options make use of the original low-level actions to construct a single temporally-
extended action. An option o is formally defined by the tuple (Io, πo, βo), where Io ⊆ S is an
initiation set that specifies all the states from which option o can be initiated, πo : S×A→ [0, 1]
is a policy which provides the probabilities of option o executing each action for each state, and
βo : S → [0, 1] is the termination condition, which gives the probability of the option terminating
for each particular state. The learned set of options can then be augmented with the original
primitive actions to form a Semi-Markov Decision Process (SMDP) [Sutton et al. 1999]. All
transitions and rewards in an SMDP are dependent on the time taken for an option to execute,
which can take t > 1 timesteps.

An SMDP is a tuple defined as M = (S,O, P ′, R′, γ), where whereby:

• S is the set of all possible states of the environment

• O is the set of options and primitive actions available to the agent

• P ′(s′, t|s, a) is the transition probability function, which is now a function of time t

• R′(s, a, s′, t) is the reward function, which is also a function of time t

• γ ∈ [0, 1] is the discount factor

6


Learning options is an active field of research: examples include using inverse reinforcement
learning and reward segmentation to discover skills [Ranchod et al. 2015], constructing chains of
skills through change point detection [Konidaris et al. 2010] and using policy gradient theorems
to learn intra-option policies and termination functions [Bacon et al. 2017].

The majority of attention in the area of learning options in MDPs has focused on using the
same state representation for every option. While this successfully achieves action abstraction,
the skills are still specific to the domain in which they are learned. Learning more general option
representations would make it more feasible to reuse these options in new domains that consist
of different settings, such as colours and lighting, but contain similar sub-problems.

2.3 Factored Representations

The task decomposition and factored state representations, which partition the state space into
distinct sub-states, are complementary concepts. Without a factored representation, the agent
will be able to decompose the given task into appropriate sub-tasks, but the skills used to solve
these sub-tasks will remain restricted to the domain in which they are learned. This is an issue
when transferring these skills to a new domain, as it may not contain the same features as the
original domain and determining which features need to be similar across the two domains for
the skill to apply is not trivial.

Referring back to the kitchen agent example, we may decompose the task of making eggs
into various sub-tasks such as fetching the eggs, or placing them in the pan. Each one of these
sub-tasks is focused on a specific region of the kitchen and particular items. Learning a factored
representation for the sub-task to fetch the eggs for example would show that there need to be
eggs in the environment and the agent would need to be within a certain range of those eggs,
with an accessible path to them. Other state information such as the lighting or colours in the
room, or the location of other objects can be considered irrelevant to that sub-task and do not
need to be consistent with the original domain when reusing the skill to fetch eggs in a new
domain. The two most common frameworks used for learning such factored state representations
in hierarchical reinforcement learning settings are Object-Oriented MDPs and Factored MDPs.

2.3.1 Object-Oriented MDPs

Object-Oriented MDPs (OO-MDPs) decompose the agent’s environment into a set of n objects,
O = {o1, ..., on} [Diuk et al. 2008]. Each object belongs to an object class, which contains a set
of attributes associated with it. The current state of an object, denoted o.state, is given by the
set of assigned values for all of the attributes of the object. The MDP state s is then defined as
the union of all object states at a given point in time, s =

⋃
i=1,..,n

oi.state. For example, consider

the gridworld domain shown in Figure 2.1a, with object classes agent, ball and wall. All classes
contain position attributes x and y, the agent class also contains a direction attribute and the
ball class contains a colour attribute and a carried attribute to denote whether it has been
picked up by the agent.

7


(a) Gridworld domain consisting of two rooms connected by a hallway. Cells are coloured to match the
corresponding leaves in the tree below. The red triangle indicates the agent’s location and the blue circle
indicates the location of a ball object.

(b) Decision tree model for predicting the change in the X state variable for the domain shown in Figure
2.1a

Figure 2.1: This figure shows the decision tree model that predicts how the X state feature
will change for a specific two room stochastic gridworld environment. The tree is split based on
actions and states, indicated by the blue rectangles. The coloured rounded rectangles are the
leaves of the tree and display the probabilities of changes in X. For example P (0) = 1 predicts
that X will not change with a probability of 1, while P (1) = 0.8 indicates a prediction that X
will change with probability 0.8 [Hester and Stone 2013].

8


OO-MDPs also capture interactions between objects as relationships. Formally, a relation-
ship r : Ci×Cj → Boolean is a function defined at class level, whose value is dependent on two
objects o1 ∈ Ci and o2 ∈ Cj . Again looking at the gridworld domain in Figure 2.1a, the relation-
ship touching(o1, o2) defines whether object o1 is one block away from object o2. If relationship
touching(agenti, ballj) is true, then the effect of action pickup is defined as ballj .carried← true.
If a new object of an existing object class is added to the environment, then it will inherit all
of the attributes and relationships that belong to that class. However, if a new object class is
added to the environment, then the attributes of that class and the relationships with other
classes will need to be created manually.

The deictic object-oriented framework [Marom and Rosman 2018] is an extension of OO-
MDPs, which explores learning more transferable transition dynamics. Traditionally the tran-
sition dynamics would be described in terms of first-order predicates, which indicate to which
object the transition applies to. For example, when the agent is unlocking the yellow door, it
specifically requires the yellow key object. If the number of key objects in the domain changes,
then this will affect the number of preconditions that the transition dynamics depend on. Deic-
tic OO-MDPs overcome this by learning deictic predicates, which are grounded with respect to
a specific reference object. For example, the predicates required for unlocking the yellow door
would be grounded with respect to the yellow key, rather than any key in the domain. This
allows the transition dynamics to be effective when transferred to new tasks, even if the number
of objects differs from the original training domain.

Another framework which looks to leverage the structure provided by OO-MDPs to transfer
learned knowledge to new domains is Transfer Options (TOPs) [MacGlashan 2013]. TOPs are
options which are learned in an OO-MDP with the specific goal of being transferable to multiple
different target domains. Similar to regular options, TOPs consist of an initiation set and a
termination condition, but also require a state space mapping between the source and target
task. This mapping is then used to apply the entire source policy to the target domain. The
limitation with the TOPs framework is that the state space mapping needs to be handcrafted
by a domain expert, which can be a time-consuming process. To tackle this issue, Topin et al.
[2015] present a framework that makes use of an unsupervised learning method to automatically
generate mappings between domains. While this is effective in guiding the agent as to which
policies are most appropriate to transfer to the target domain, the method still suffers from
the limitations of the underlying OO-MDPs in that each domain’s dynamics are required to be
manually created by an expert.

2.3.2 Factored MDPs

Another commonly used framework when working with factored representations is the factored
MDP (FMDP) framework [Boutilier et al. 2000], which allows for reward and transition functions
to be represented compactly. FMDPs are MDPs where state variables are represented by a finite
set of random variables and the state space of the FMDP is defined by the Cartesian product
of the domains of these random variables, S = {S1, ..., Sn}. Thus, a single state in an FMDP
is represented as a vector containing values assigned to each of the state variables. Since the
number of states increases exponentially as the number of variables in the FMDP increases, it
is desirable to represent the transition function in a way so that the inter-variable dependencies
can be best exploited, and that the adverse effects of exponential growth are reduced.

9


Assuming that the problem MDP is finite, then the transition function P (s′|s, a), as described
in Section 2.1, consists of a finite set of probabilities. If the state space can then be decomposed
into a finite set of random variables as described above, then these transition probabilities can
also be decomposed, resulting in a product of probabilities [Degris et al. 2006]. This allows the
transition function to be represented more compactly, by exploiting the independencies between
the random variables. Similarly the reward function can be represented as the average over all
factored rewards.

Vigorito and Barto [2008] present a method to learn options within an FMDP, by using
the Variable Influence Structure Analysis (VISA) algorithm [Jonsson and Barto 2006]. VISA
models the FMDP as a Dynamic Bayesian Network (DBN), which allows it to identify the causal
relationships between state variables. Options are then discovered within this FMDP using
standard option discovery techniques and these options are used for exploration to improve on
the state model, which is then factored again in a recursive process.

DBN’s ability to represent dependencies between random variables makes them the data
structure of choice when working with FMDPs, but they are also one of the main limiting
factors. Most research within FMDPs assumes that the structure of the DBN is known, and
while there have been attempts to autonomously learn this structure, it has been shown that
learning the structure of a DBN from data is an NP-hard problem [Chickering et al. 2004]. As
a result, similar to OO-MDPs, designing FMDPs is also a time-consuming process in which an
expert is required to manually encode the variable relationships in the domain and adjust the
network structure each time the domain changes.

Another limitation of FMDPs is that they make the assumption that the agent’s environment
and the reward function are always factored. This assumption is quite restrictive as it is not
always possible to decompose the reward function into factored local rewards. Similarly, if the
environment contains any state variables which have causal relationships and dependencies that
don’t allow them to be decomposed, then the problem cannot be represented as an FMDP.

2.4 Decision Tree Based Models

Another data structure which is capable of modeling transition and reward dynamics is a decision
tree. While decision trees are used much less frequently in conjunction with FMDPs due to their
inability to capture inter-variable dependencies, they are far more robust in adapting to changes
in the domain. TEXPLORE [Hester and Stone 2013] is a model learning algorithm, which
learns models of the domain using random forests of decision trees to model transition and
reward dynamics. The algorithm is similar to the way in which DBNs are used for learning
FMDPs, in that it learns separate predictions for each individual state feature, in conjunction
with the reward.

Decision trees are used in TEXPLORE to approach model learning as a supervised learning
problem, where the learner is predicting the next state s′ and reward r, from the input pair
(s, a) comprising of the current state and action. Each node of the decision tree is split using
the C4.5 algorithm [Quinlan 1986], which uses information gain to determine the optimal split.

10


Given a vector containing the n state features and action a, each tree makes a prediction for
a specific feature or reward. The TEXPLORE algorithm works in an on-line manner, meaning
that the tree is generated and refined while the agent is acting within the MDP. Beginning with
an empty tree, the algorithm will continue to refine the tree based on new agent experiences.
Eventually the tree will contain leaves for each individual state in which the transition dynamics
vary from the global state dynamics.

Consider the simple gridworld domain shown in Figure 2.1a, consisting of two rooms con-
nected by a hallway. The domain is stochastic, with an 80% probability of the agent moving
in the desired direction of a given action, and a 20% probability that the agent remains in the
same state. Figure 2.1b shows the decision tree model for predicting the change in the X state
variable. The tree is split on both the agent’s actions and individual state variables, as shown by
the nodes in the blue rectangles. The gridworld is coloured to match the corresponding leaves
in the tree when taking the action EAST. While the agent is acting within the domain, the
tree concurrently gets built in an on-line manner. To observe how the tree divides the state
space, consider taking action EAST in state X=14. Unlike the global state dynamics where the
agent’s X variable would increase by 1 when taking action EAST, in this case the agent would
not transition to a new state as it is against a wall. For this reason, we say that the tree splits
the states up into sections where the transition dynamics vary from the global state dynamics.
Once the tree is complete we can see that all of the leaves give probabilistic predictions of how
the X variable will change based on the outcomes that the agent experienced when exploring
the domain.

The use of decision trees provides a sample efficient way of modelling the MDP, but can
often be incorrect for infrequently visited state-actions. To provide a more accurate prediction
TEXPLORE uses the random forest model [Breiman 2001], which consists of a collection of
decision trees. Each decision tree in the random forest is only trained on a subset of the agent’s
experiences within the MDP. Given a state-action pair (s, a), the random forest model makes
a prediction by averaging all of the predictions from each of the trees. For example, if three
trees predict outcome 0 and outcome 1 with equal probability of 0.5 and a fourth tree predicts
outcome 0 with a probability of 1, then the random forest model will predict outcome 0 to occur
with a probability of 0.625 and outcome 1 with a probability of 0.375.

While the focus of TEXPLORE is on learning transition dynamics to aid agent planning,
the algorithm’s ability to separate the state space into different regions in which the transition
dynamics differ is what is most relevant to our research. These groupings allow us to find
commonalities between tasks, which may allow for the transfer of skills, rather than spending
time redundantly learning new skills for tasks we have seen before.

2.5 Transfer Learning

The ability to reuse rather than relearn skills is highly beneficial in improving the efficiency of
a reinforcement learning agent. To achieve this, knowledge needs to be transferred from tasks
that have been seen before to new, unseen tasks. Taylor and Stone [2009] provide an extensive
overview of the concepts of transfer with respect to reinforcement learning tasks. Knowledge
gained from previously solved tasks, known as source tasks, is used to improve the learning of a
new unseen task, called the target task.

11


There are various transfer learning approaches, which transfer different types of knowledge,
ranging from action-value functions, to policies, to full task models [Konidaris and Barto 2007;
Pickett and Barto 2002; Lazaric 2012]. We are particularly interested in the transfer of options
and their dynamics models between tasks. The most straightforward form of option transfer
would be to transfer the source options directly to the target task, but this is often not beneficial
as the source options are often overfitted to the source task and require strong state overlap
for their policy to be applicable in the target domain [Taylor and Stone 2009]. One way to
counteract this is to provide a state space mapping between source and target tasks as in the
TOP framework [Topin et al. 2015] discussed in Section 2.3.1.

Another approach explores separating the state space into agent-space and problem-space
[Konidaris and Barto 2007]. Agent-space is the set of features that remain consistent across tasks,
such as observations by robot sensors, while problem-space defines the set of states, transition
probabilities and reward function based on the current environment that the agent is in. Options
learned in both agent-space and problem-space can be used to significantly increase the total
reward and initial performance of the target task. It is suggested that agent-space options
transfer better in cases where the source task and target task differ more, while problem-space
options are useful when the source task and target task are more similar [Konidaris and Barto
2007].

Given these various transfer learning approaches it is important to be able to distinguish
how effective they are. When considering the entire process of transfer learning one must decide
whether or not to incorporate the time spent learning the source task when evaluating the
efficiency of the approach. It is suggested that if the goal of transfer is to reuse knowledge
in novel tasks, then it is reasonable to only account for time spent learning in the target task
[Taylor and Stone 2009]. Given this, Taylor and Stone [2009] present five metrics which can be
used to evaluate the performance gain of transfer learning approaches, in relation to learning
without transfer:

• Jumpstart: How much does the initial performance of the agent in the target task improve,
given transferred knowledge from a source task?

• Asymptotic Performance: How much has the final performance of the agent in the target
task improved, given transferred knowledge from a source task?

• Total Reward: How much has the total reward accumulated by the agent increased? This
can be calculated by taking the area under the learning curve as seen in Figure 2.2

• Transfer Ratio: What is the ratio of the total reward accumulated by the agent with
transfer and without transfer? Transfer ratio can be considered as the ratio of areas
defined by learning curves with and without transfer. It can be calculated by the equation:

r =
area under curve with transfer− area under curve without transfer

area under curve without transfer
(2.3)

• Time to Threshold: How much faster does the agent reach a pre-specified performance
level when learning the target task, given transferred knowledge from a source task?

Figure 2.2 provides a visual representation of some of these transfer metrics.

12


Figure 2.2: Graph depicting the different metrics for evaluating transfer. Jumpstart, asymp-
totic performance and time to threshold are indicated on the graph, while total reward can be
calculated as the area under the learning curves [Taylor and Stone 2009].

2.6 Related Work

While the Transfer Options framework [Topin et al. 2015], discussed in Section 2.3.1, is the most
comparable work to our framework, the field of learning specialised representations for hierar-
chical learning remains an active research area. One such example is the TeXDYNA method
which looks at option discovery within FMDPs [Kozlova et al. 2010]. TeXDYNA breaks down a
single environment with the FMDP structure and then solves it using hierarchical reinforcement,
by using options that are learned within that factored environment. This is inherently different
from our work in that we wish to learn factored representations of the options themselves, with
the aim of transferring them to new domains to reuse. TeXDYNA would not be practical to
use for our approach as it would require learning the factored state representation of each new
environment, rather than learning a one-off representation which could apply across multiple
environments.

The object-oriented dynamics predictor framework [Zhu et al. 2018], presents a method for
autonomously predicting OO-MDP structures for a given environment. The predictor uses a
neural network to decompose pixel-based states into objects and learn the object-level dynamics
and relationships, although it does not focus on any hierarchical reinforcement learning appli-
cations of the learned OO-MDPs. Extending this framework to operate in skill based settings
is a potential avenue for future work, as it would provide an interesting comparison with the
framework that we present in this study.

13


2.7 Summary

In this chapter we introduced the background work upon which this research is based. This
includes presenting an overview of reinforcement learning, hierarchical reinforcement learning
and factored representations. We then discuss previous works that have successfully combined
hierarchical reinforcement learning and factored representations, and elaborate on why these
two concepts can be integrated to accelerate learning, particularly through transfer.

Despite the successes of the research described here, these works still rely heavily on knowl-
edge of the structure of the environments to be known beforehand and on input from the
researcher to handcraft the domains and experiments [Topin et al. 2015; Vigorito and Barto
2008]. In the following chapter we outline the methodology of our research, which provides a
framework which can autonomously learn factored representations for skill based reinforcement
learning.

14


Chapter 3

A Framework For Learning Factored
Skills

In the previous chapter we outlined the background knowledge required for implementing this
research. This included an introduction to reinforcement learning, followed by detailing the
benefits of task decomposition, factored state representations and transfer learning. We wish
to unify these concepts by learning generalised representations of skills that can be reused
across an array of different tasks. In this chapter, we detail our framework for autonomously
learning such representations and elaborate on how these factored skills can be used to accelerate
learning of new tasks in Section 3.1, including a worked example of the framework methodology
in Section 3.1.1. We then explain how the initial factored representations can be improved upon
given additional training tasks in Section 3.2 and then present our method for grouping similar
variables in Section 3.3.

3.1 Learning Factored Options

Given an initial training domain consisting of a set of tasks, and an agent with a set of options
for solving these tasks, our framework learns representations for each of these options that
only contains the variables relevant to solving each particular task. These factored option
representations can then immediately be used in a new domain containing similar tasks to
accelerate learning, without needing to adjust them to any variations in the new domain.

To explain our methodology, we decompose our framework down into four sections, shown
in Figure 3.1. The sections can be described at a high level as follows:

1. Exploring the training domain with the agent to generate a set of trajectories, consisting
of the states, actions and next states that the agent observes

2. Updating and formatting our set of trajectories, so that they are able to be used as input
when training our decision tree models

3. Using the trajectories to learn individual decision tree models for each state variable

4. Learning factored representations from these models and applying these representations
to our options to create factored options

15


Figure 3.1: Diagram illustrating the framework’s factor learning process. First, the agent ex-
plores the domain, attempting to solve the initial task. This generates a set of trajectories which
are used as input for the decision tree models of the state variables. From these models factored
representations are learned for each option available to the agent. These factored options are
then made available to the agent, and it is given a new task to solve. This process continues
and the factored representations are refined after each cycle.

Since our research focuses on learning factored option representations, we begin by making
the assumption that a set of options for the initial tasks has already been learned. As our
domains are deterministic, we learn the optimal initial options by generating a graph of the
environment using NetworkX [Hagberg et al. 2008] and using the built in single target shortest
path algorithm to obtain a policy for all reachable states that lead to the relevant terminal
locations (key or door locations). While this method is used as the shortest path algorithm
is efficient, it is perfectly reasonable to make use of one of the many other option discovery
techniques [Bacon et al. 2017; Ranchod et al. 2015; Konidaris et al. 2010] to learn the initial
options for our framework.

The agent uses SMDP Q-Learning [Sutton et al. 1999] to explore and attempt to solve the
initial domain with these options. While the agent explores the domain, each state-action pair,
along with the next state that the agent reaches, is logged to generate a set of trajectories.
We want to use these trajectories to determine which state variables each option affects, and
create a factored representation for the options which only uses those relevant variables. To do
so we make use of decision tree models, which are learned for each state variable. Our input
x for the decision trees is the set of states s with action a appended. Since we are interested

16


in the transition dynamics for each state variable, we calculate the change in each variable by
looking at the difference between state s, and the next state s′ after action a has been taken,
and generating a binary indicator for each variable. Each variable is either assigned a 1 if that
variable has changed from state s to s′, or a 0 if that variable has remained constant. The target
y for a particular decision tree model will then be the set of all binary indicators for a single
variable. Our framework takes these inputs and target values and learns decision tree models for
each state variable, using the Scikit-Learn DecisionTreeClassifier [Pedregosa et al. 2011], which
makes use of the CART algorithm [Breiman et al. 1984].

While we draw inspiration from TEXPLORE [Hester and Stone 2013] in learning decision
tree models of the transition dynamics for each state variable, our decision trees differ in that
the branches are split on states and options, rather than states and actions. Any primitive
actions that are able to change a particular variable are already encapsulated by the option
which comprises of those actions and thus are redundant when learning the factored options’
models. We only consider trajectories in which an option is executed and discard any trajectory
which only contains primitive actions.

Once we have obtained the decision trees for each state variable, we wish to generate a
factored option representation for each of the available options. Algorithm 1 presents the pseu-
docode for the process of learning these representations. To briefly describe this process, we
loop through all of the decision trees and identify all of the trees in which the option appears
(lines 1, 2). If an option appears in a tree traversal to a leaf node in which the state variable
changed, then we know that the option is dependent on that state variable (lines 3, 4). The
algorithm then returns a list of all the state variables that each option is dependent on (line 7),
which we formally refer to as a factored state set.

Algorithm 1 Learn Option Factoring

1: procedure factor options(state var)
2: for each decision tree d ∈ D do
3: for each positive node n ∈ N do
4: t← traversal(n) . append traversed variables and options to set t

5: for each option o ∈ O do
6: if o ∈ t then
7: factored set[o]← state var . append state variable to factored set

We provide a set of formal definitions to help ground the notion of factored options learned
by our framework, as well as the factored state space over which they are learned and across
which they are transferred.

Definition 1: A factored state set is a subset of the global state variable set, S = {S1, ..., Sn},
which consists of n random variables. Formally we denote the factored state set as φ ⊂ S.

Definition 2: A factored state space Sφ, contains the set of all possible combinations of the
variables for a specific domain, given a factored state set φ.

To illustrate the difference between these two concepts, an example of a factored state
set may be φ = {xPosition, yPosition}, for which the factored state space would be Sφ =
{(0, 0), (0, 1), ..., (n-1,n-1)} for an agent acting in an n× n grid.

17


Definition 3: A factored option is defined as an option acting over a given factored state
set φ. Therefore the initiation set and termination condition of a factored option are subsets of
the factored state space Sφ. Formally this can be written as FO = (I∗o , π

∗
o , β
∗
o , φ), where I∗o ⊂ Sφ

is the initiation set over the factored state space and β∗o is the termination condition, defined as
a probability distribution over Sφ.

Figure 3.2 shows how the factored state set can be used to apply a factored option to a new
domain. The option requires the factored variables to be named the same across domains and
to exhibit the same transition dynamics, for the option to be executable. For example, using
the pickup action when next to a yellow key should always result in the hasYellowKey variable
becoming true. This mapping between domains is invariant of both the location of the variables
within each domain’s global state variable set, and the total number of variables in the set.

Figure 3.2: Example of how the factored state set is used to generate factored representations,
which can then be mapped to new domains. In this example, phi refers to the set of variables
xPosition and yPosition, on which the option depended. Assuming that the top state is an
initiation set for the option, the factored option would be able to be initiated in the bottom
state as the dependent variables match.

3.1.1 Minigrid Example

To further outline this process we provide a detailed breakdown of the methodology setup
through the use of an example experiment. We make use of the Minigrid environment [Chevalier-
Boisvert et al. 2018] for this worked example as that is the environment which we use for our
initial experiments when evaluating our framework. In the Minigrid environment there are 5
actions available to the agent, namely turn left, turn right, move forward, the ability to pick up
objects and the ability to use objects. In each domain the agent is posed with a set of tasks
that are required to be solved sequentially.

18


To demonstrate this, consider the example domain shown in Figure 3.3, where the agent is
represented by the red triangle and the goal by the green square. This domain consists of three
rooms, connected by locked doors. To proceed to subsequent rooms the agent is required to
unlock the doors by obtaining the key corresponding to the door colour. The sequence of tasks
in this domain can be broken up as follows: obtain the yellow key, unlock the yellow door, obtain
the blue key, unlock the blue door and proceed to the goal state. The sparse rewards in these
domains make them a challenge for non-hierarchical methods. Formally, we define the reward
function as 100 at the green goal state and −1 elsewhere.

We define the state vector for this domain as follows:

(x, y, hasY ellowKey, yellowDoorOpen, hasBlueKey, blueDoorOpen)

where x and y describe the position of the agent, while the key and door variables are boolean
indicators.

Figure 3.3: Example Minigrid domain, consisting of 3 rooms. Each room is connected by a
locked door, which can only be unlocked by using the key corresponding to the door colour. The
agent location is indicated by the red triangle and the goal state by the green square.

Continuing with our example domain, we begin with options that allow the agent to obtain
each of the keys and options to unlock each of the doors. For example, the options to obtain the
blue key and unlock the blue door are visualised in Figure 3.4. The NetworkX graph used for
learning our initial options is built using data generated from an agent exploring the domain.
Hence the initial options are dependent on the sequence of tasks that the agent is required to
complete. This ensures that these options have the same conditions for their initiation sets and
termination conditions as that of discovered options.

For example, consider the option that takes the agent to collect the blue key; when learning
this option the yellow key will have already been collected and the yellow door will have been
opened. As such, variables hasY ellowKey and yellowDoorOpen are set to 1 in the initiation set
of the option. Using the state vector described previously, we can formally define the initiation
set as (x∗, y∗, 1, 1, 0, 0) for the option to obtain the blue key (where x∗ and y∗ are any valid values
of x and y, as shown in Figure 3.4). The termination condition is then given by (x∗, y∗, 1, 1, 1, 0).
The option policy can be seen in Figure 3.4a.

19


(a) Option policy to collect blue key (b) Option policy to unlock blue door

Figure 3.4: Visual representation of the option policies that take the agent to collect the blue
key and to unlock the blue door respectively. Movement actions are indicated by arrows, the
pickup action by stars and the use action by a dot.

We then learn the decision tree models of each of our state variables. For example, the
decision tree for the hasBlueKey variable is shown in Figure 3.5. We then learn the factored
state set for each option by following the procedure described in Algorithm 1. In this example
the decision trees for the hasYellowKey and yellowDoorOpen do not contain the option to open
the blue door, and as such these variables are excluded from the factored state set, even though
they were required to be true in the original handcrafted option. The factored set for the option
to open the blue door is (hasBlueKey, blueDoorOpen). The new initiation set can then be
thought of as (0, 0) for these two variables, and any arbitrary variable values for the rest of the
state variables. This is useful as we can now transfer this option to a new domain containing
a blue key and reuse it without having to collect the yellow key first. In fact, the new domain
is not required to contain a yellow key at all. So long as the domain contains hasBlueKey and
blueDoorOpen variables, this option will be able to be executed.

Figure 3.5: Decision tree for the hasBlueKey state variable. The tree is split on options and
state variables, shown by the blue rectangles. The leaves provide probabilistic predictions on
how the hasBlueKey variable will change.

20


3.2 Refining Representations

If we consider the framework as described in Section 3.1, the factoring that is learned is heavily
dependent on the initial task that the agent is presented with. This is not ideal as it is highly
likely that the decision tree models that have been learned have been overfitted to this training
domain. This may result in the framework excluding certain state variables for which the domain
provides limited or no data on or considering certain variables as dependent due to the structure
of the domain.

TEXPLORE [Hester and Stone 2013] attempts to address overfitting by learning multiple
decision tree models for a single domain, and averaging them using a random forest model. This
approach results in better models for the given domain, but these models are still reliant on the
information provided by that single domain. We show that given more training domains which
consist of similar tasks, our framework is able to learn more general models that do not overfit
to a single domain.

We implement this by providing the agent with a set of domains, consisting of similar tasks,
for the agent to solve sequentially. Referring back to our Minigrid example, the overall structure
of the domain would remain the same (3 rooms with keys and doors), but the tasks may differ by
altering key and door colours and locations for example. After the framework has learned a set
of factored option representations from the first domain, these options are then provided to the
agent to use in exploring the second domain to generate a new set of trajectories. After gener-
ating trajectories for a new domain, they are then concatenated with the trajectories generated
from all previous domains, and used to learn new decision tree models and factored represen-
tations. This cyclic process is repeated, with the agent continually refining its representations
after exploring each new training domain.

Since the set of trajectories now span across multiple domains, the models learned from
these trajectories are better able to generalise than with trajectories from a single domain alone.
Referring back to Figure 3.1, we see that the four steps that our framework follows can be
repeated in a closed-loop cycle, which continually refines the learned representations.

3.3 Variable Grouping

Consider an agent with a camera that provides RGB data. Based on the current framework, each
sensor reading is considered independent, so it would be possible for example that the green or
blue variables are considered irrelevant to a particular option and are excluded from the option
factoring. It does not make sense in this case to deal with these camera inputs independently
as they are not independent and identically distributed random variables. Rather, it would be
more sensible if these variables could be grouped and treated as a single sensor input.

Pierce and Kuipers [1997] suggest that robotic sensors that are physically close in proximity
tend to behave similarly. Similarly behaving sensors are defined as sensors which have a similar
value at each timestep and a similar frequency distribution. For example, a robot in a real-world
setting, may have sensors related to the robot itself, such as its position and direction, vision-
based sensors which provide information about the orientation of objects around the robot, and

21


environment-based sensors such as temperature or wind speed. Figure 3.3 shows a toy example
illustrating how these groupings could be applied.

Due to the fact that these grouped sensors exhibit similar behaviour and record similar
variables, we make the assumption that if an option is dependent on the majority of a sensor
groups variables, then it is probably dependent on all of that group’s variables. Algorithm 2
shows the pseudocode for applying this grouping to our factored options. For each set of sensor
groupings, we count the number of variables from that group that are in the option’s factored
set (lines 3, 4). If the count is greater than the group threshold, then we add all of that group’s
variables to the option’s factored set (lines 5, 6).

Algorithm 2 Group Factoring

1: procedure group factoring(O, t) . threshold t
2: for each factored option o ∈ O do
3: for each sensor grouping s do
4: c← group count(s, o) . count group variables in factored set
5: if c ≥ t then
6: factored set[o]← s . append entire group to factored set

Figure 3.6: Diagram illustrating the application of variable groupings. The agent has information
from three different systems: a GPS system providing the agent’s location, a camera showing
the agent’s visual observation and a pose estimation system describing the joint angles of the
agent. Observations from these three systems are concatenated together to form the entire state
space. We aggregate all the individual readings or variables into three separate groups, one for
each system.

22


3.4 Evaluation of Transferability of Factored Options

Once we have learned the factored option models, we wish to evaluate their ability to improve
learning in new tasks. To achieve this we need to transfer the options to our new domain.
We transfer the entire factored option, including the option policy, initiation set, termination
condition and factored state space. In this case transfer involves making these factored options
available as part of the action space for the SMDP for the target domain. The agent will then
have both Minigrid’s primitive actions and the transferred set of options available to it when
learning in the target domain.

To demonstrate the benefit of our factored options, consider the domain shown in Figure
3.7b, which is almost identical to the one in which we learnt the factored options, shown in
Figure 3.7a. The only difference to the original domain is the starting position of the agent.
Consider our initial handcrafted option to collect the blue key, which required the yellow key to
be collected and the yellow door to be open. In this new domain the handcrafted option cannot
be executed as it is impossible for the agent to obtain the yellow key. Conversely, our factored
option is not dependent on the yellow door and yellow key and so would be executable in this
new domain.

(a) Source domain in which the options
were learned

(b) Target domain to which the options
were transferred

Figure 3.7: Example of two domains across which transferring traditional options will fail, but
factored options will work. The two domains are almost identical, except for the starting position
of the agent.

While these two domains have been specifically tailored to display the benefits of factored
options, we wish to discern the ability of these options to generalise across a multitude of different
tasks. We generate new tasks by sampling domain features from a predefined distribution. The
features that we sample for a new domain include agent start location, key location, key colour,
door location, door colour, wall locations and goal location. Each of these variables are sampled
from their own predefined distribution. For example, we may sample two key colours without
replacement, from a set of four available colours, to be used in the domain.

23


This variable sampling results in different domains for similar tasks, for example two domains
may have a task to collect the yellow key, but the agent start location, key location and colour
of other keys in the domain are all different between the two domains. These are the cases in
which we are interested in evaluating our factored options, although we are also interested in
investigating the effect of additional variables, whether they be relevant or extraneous. Since the
reward function depends on the goal location, which does not change across the two domains,
the reward function also remains constant.

3.5 Summary

In this chapter we presented our framework for autonomously learning factored skill represen-
tations. In addition to the methodology of the framework, we also addressed how we combat
various issues which appear when attempting to learn these representations. This includes the
refinement process, which updates the representations given multiple tasks to prevent overfitting
to a single task, and the use of variable groupings to prevent the exclusion of sensor readings
for which there is limited or no data available during training.

In the following chapter we focus on demonstrating the performance of our framework. We
present a set of baseline algorithms with which we compared our framework and detail how we
evaluated the framework’s performance. look at various settings where our framework excels
against other methods and also highlight some of the limitations of the work. This includes
investigating the effect of various parameters on the framework’s functionality.

24


Chapter 4

Experimentation

In the previous chapter we outlined the methodology that we proposed for implementing our
framework to obtain factored options. In this chapter, we present a set of experiments to
evaluate the performance of our framework compared to various baseline algorithms. We begin
by introducing the domains which we used for our experiments in Section 4.1 and elaborate
on how we evaluate the performance of our factored options, including an explanation of the
algorithms with which we compare our framework in Section 4.2. We then present a set of
initial experiments in Section 4.3 to show the general performance of the framework, before
expanding on to more detailed scenarios where the framework excels. This includes extending our
framework to learn locally transferable options in Section 4.4 and evaluating the performance in
domains with noise in Section 4.5. We conclude by demonstrating the framework’s performance
in a simulated real-world kitchen domain in Section 4.6.

4.1 Environments

For our initial experiments we make use of the Minigrid environment [Chevalier-Boisvert et al.
2018] that we referred to in Section 3.1.1 as a worked example for our methodology. Each domain
consists of a 10×10 grid, although the outer cells are all walls, which means that the agent can
only interact with an 8×8 space. The action space for all domains provides the agent with 5
actions – turn left, turn right, move forward, the ability to pick up objects and the ability to use
objects. All domains consist of three rooms and comprise of the same five tasks (obtain key A,
open door A, obtain key B, open door B and go to the goal location) in different configurations.
The reward function is consistent across all of the domains, and is defined as 100 at the goal
location and −1 elsewhere.

To generate domains consisting of these different configurations, we define a distribution of
item locations, wall locations and item colours and generate each domain by randomly sampling
from each of these three distributions. We repeat this process to generate a set of training
domains and a set of test domains. Figure 4.1 shows an example of two different generated
domains.

4.2 Baseline Comparisons

As discussed in Section 2.3.1, object-oriented algorithms are currently the favoured method for
learning factored skill representations. Despite this, there is no common baseline algorithm
available for us to compare our framework with. Instead, we make use of the OO-MDP learning
algorithm DOORMAX [Diuk et al. 2008].

25


Figure 4.1: Example of two generated domains. The agent is represented by the red triangle
and the goal location by the green square.

First we handcraft our OO-MDP object classes and attributes as shown in Table 4.1.
We then define a set of five conditions between objects: touchingN (o1, o2), touchingE(o1, o2),
touchingS(o1, o2), touchingW (o1, o2) and direction. The touchingN (o1, o2) condition describes
the situation where o1 and o2 share adjacent grid cells, where o2 is to the North of o1. For
example, the proposition touchingN (agent, wall)∧direction = N tells us that the agent is facing
north and touching a wall to the north. The effect of taking the action move forward under this
proposition is no-change.

Class Name: Agent Key Door Wall

Attributes: xPosition xPosition xPosition xPosition
yPosition yPosition yPosition yPosition

colour colour
collected opened

Table 4.1: Table showing the object classes and attributes for the Minigrid OOMDP.

We then adapt the DOORMAX algorithm included in the Efficient Reinforcement Learning
framework [Borea 2020] to solve these OO-MDP domains. As we are applying DOORMAX
to the target domains, this can be seen as an upper bound of the transfer based OO-MDP
algorithms mentioned in Section 2.3.1, such as TOPs, since Diuk et al. [2008] show that the
DOORMAX algorithm is provably efficient.

Additionally, we evaluate our factored options against the following methods:

• Perfect options – we handcraft the optimal options for the target domain and solve the
domain using SMDP Q-Learning [Sutton et al. 1999]. This can be viewed as an upper
bound on the performance of the factored options, equivalent to the case where all of the
learned options can be deployed everywhere in the target domain.

• No options – we solve the target domain using standard Q-Learning with primitive actions
only. This can be viewed as a lower bound on the performance of the factored options,
equivalent to the case where none of the learned options are applicable in the target
domain.

26


• Unfactored options – we transfer the same options learned during training, but without
applying any factoring to them.

When evaluating these algorithms we specifically focus on the jumpstart, asymptotic perfor-
mance and area under the curve metric of the options, as discussed in Section 2.5. Jumpstart
and asymptotic performance are useful metrics to assess the performance between algorithms
that are able to solve the domain and do converge to the same reward. In the more complex
domains though it is infeasible to run Q-Learning without options for enough episodes for it to
solve the domain, and in these situations calculating the total reward from the area under the
curve provides us with a way to evaluate these performances.

4.3 Initial Experiments

Our initial experiments test the validity of our framework. For these experiments we generate
50 training domains and 20 test domains. Each experiment involves randomly sampling six
domains from the training set, which are given to the agent to solve sequentially. The agent
begins with a set of four optimal options which can be used to solve the training domains. These
options allow the agent to collect the first key, open the first door, collect the second key and
unlock the second door.

For these experiments we focus on the performance of the factored state set φ that is learned
by our method for each of the four tasks, rather than the performance of the options themselves.
This is because the policy of the options in this current state is overfit to the specific task for
which it was learned. For example, consider a domain where the yellow key in the first task
is in the top right corner and we have an agent with the option to collect the yellow key. If
we transfer this option as is to a new domain where the yellow key is in the top left corner,
executing the option will still take the agent to the top right and this will not be beneficial for
learning to solve the new task. In Section 4.4.1, we look at a solution to this by making use of
a more agent-centric observation.

To evaluate the performance of φ, we apply the factored state set to the optimal options for
the test domains, to obtain factored options. We then use these factored options to learn in the
test domains. In the best case we expect the performance to be close to that of the optimal
options for each task, once we have converged on the correct factoring.

The agent generates trajectories to learn φ by solving each of the training domains using
SMDP Q-Learning [Sutton et al. 1999] with discount factor γ = 0.9, learning rate α = 0.15 and
ε-greedy action selection with ε = 0.05. We train the agent for 500 episodes with a maximum
episode length of 500 steps. For each entire run of our framework we sample a set of 5 test
domains which we use to evaluate the performance. We use the same parameters for our SMDP
Q-Learning agent when testing and training. After each new training domain that the agent
sees, we evaluate the learned factored state set on all five test domains and obtain an average
of the performance. Using multiple test domains helps us to assess the generalisability of our
framework, and ensures that the performance is not greatly influenced by any single training
domain, which may happen to be very similar to a testing domain in some cases given that they
are both randomly generated from the same distributions.

27


Figure 4.2 shows the average episode reward for our factored options for various numbers of
training tasks. We can see that even after only having seen a single task our framework performs
better than training without the use of options, and by six tasks the performance converges close
to that of the optimal option for the training domains, shown by the golden curve.

(a) Learning curves after 1, 3 and 6 training tasks,
compared with the optimal option performance

(b) Learning curves for all number of tasks without
confidence intervals for the sake of clarity

Figure 4.2: Average reward per episode for our learned factorings, for various numbers training
domains compared to the optimal option performance. Graphs are plotted using a rolling average
with a window size of 10 episodes.

Figure 4.4 shows a comparison between the decision tree for the hasBlueKey variable after
one training domain, displayed in Figure 4.3, and after six training domains. Figure 4.4a demon-
strates that after one domain the tree has overfit, considering variables such as the x-position
of the agent (xPosition) and whether the yellow door has been opened (yellowDoorOpen) to
be relevant to collecting the blue key. From Figure 4.3, we can see that this occurs due to the
dynamics of the training domain. In this domain the agent will always be in an x-position ≤ 3.0
when collecting the blue key, and if the yellow door has been opened we know that the blue key
will have been collected as well due to the sequential nature of the tasks.

The tree shown in Figure 4.4a can be interpreted as follows: the root node contains 2319
samples or trajectories, where 272 samples belong to class 0 and 2047 samples to class 1, as
shown by the value field. This node splits the samples based on whether xPos ≤ 3.0. The left
child node is a leaf node, which contains 131 samples where this is true, and all of these samples
result in a classification of class 0. The right child node will then have 2188 (2319−131) samples,
of which 272 are of class 1 and 1916 (2047 − 131) are of class 0. This node now splits the tree
on whether hasBlueKey ≤ 0.5 and we repeat this process until we are only left with leaf nodes.

Figure 4.4b shows that after the agent has seen more examples it is able to learn a more
general model of the state variable. Traversing up this tree from the positive orange leaf node, we
can read this as: option getBlueKey ≤ 0.5 must be False (we must execute option getBlueKey)
and variable hasBlueKey ≤ 0.5 must be True (we must not have already collected the blue key).
From this it can be seen that once the model has converged, the only relevant variables to the
task of collecting the blue key are the option getBlueKey and whether the key has already been
collected or not, indicated by variable hasBlueKey. Appendix A shows the converged decision
trees for other state variables.

28


Figure 4.3: Initial training domain which resulted in the decision tree model shown in Figure
4.4a

(a) hasBlueKey decision tree model after one
training domain (Figure 4.3)

(b) hasBlueKey decision tree model after
six training domains

Figure 4.4: Example of two decision tree models learned during training. The first decision tree
is learned after having only seen one training domain, while the second tree is the model learned
at the end of training after having seen six training domains. The colour of each node indicates
the majority class for that node (blue = class 0, orange = class 1 and white for a tie).

29


4.4 Locally Observable Domains

As mentioned in Section 4.3, directly transferring the options to new domains is generally inef-
fective as they are too restricted to the original domain in which they were learned. One method
to tackle this is to decompose the problem into agent-space and problem-space [Konidaris and
Barto 2007]. Agent-space contains the features that are consistent to the agent across all tasks,
while the problem-space is defined by the environment in which the agent is currently acting.
This allows the agent to learn more portable options which can be transferred between tasks
that share the same agent-space. However, an issue with such agent-centric representations is
that they are often non-Markov [Konidaris et al. 2012].

We have shown in Section 4.3 that the factored state sets that are learned by our framework
are effective. We wish to draw inspiration from the agent-space approach, to apply these repre-
sentations to a set of options that can easily be transferred across domains. To achieve this, we
consider the case where the agent has a partially observable local view available. Figure 4.5a
shows an example of this in a Minigrid domain, where the agent has a 3×3 vision grid. Each
cell in this observation grid corresponds to a number depending on its contents. For example,
an empty cell is a 0 and a cell containing a wall is a 10. We then flatten this 3×3 grid into a
vector of size 9 and append it to the current state observation that the agent receives. We label
the individual cell variables v1, ..., v9.

The agent’s local observation now provides us with a more portable representation between
domains. For our option to be executable in a new domain we now only require a set of 3×3
cells to match that of the original training domain. Using this new observation, we are no
longer required to learn a new set of options for each new training domain. Instead we begin
with a single set of options learned for the initial training domain. We then learn a factored
representation for these options, apply this factoring, and then transfer these factored options
to the next training domain where we repeat the process to refine our factored representation.

4.4.1 Variable Groupings

Given the partially observable domain seen in Figure 4.5a, the resulting decision tree model
that is learned for the red key variable is shown in Figure 4.5b, which highlights a common issue
that arises when dealing with such domains – not all vision cells are considered important. In
this case the model only considers v2, v3, v5, v6 and v9 to be important to the red key, but
excludes v1, v4 and v7 (the 8th cell is excluded as the agent always occupies this cell). This is
due to the dynamics of a particular domain, as it is possible that the agent never sees certain
items in specific vision cells. For example, due to the fact that the first room in the domain in
Figure 4.5a is a 2-cell wide hallway, it is unlikely for the agent to see the red key in any of the
left-hand side vision grid cells, unless it approaches the key from the opposite side of the room
from which it starts.

30


(a) Training domain with a 3×3 local
agent view, labelled 1 through 9.

(b) hasRedKey decision tree model learned from
training domain shown in Figure 4.5a

Figure 4.5: Example of decision tree model which fails to capture certain vision variables due
to the dynamics of the training domain.

If we now transferred this option to a new domain in which it was possible to see the red key
in one of those vision cells, the option would not be able to execute in those cases. While the
framework is still learning the correct factored models given the input data, we can see how the
dynamics of the training domain can hurt the generalisability of these models. To counteract
this we introduce the idea of grouping similar sensors [Pierce and Kuipers 1997], or in our case
similar state variables, as discussed in Section 3.3.

We begin by handcrafting our state variable groupings, although there is room for future
work to extend our framework to use autonomous grouping [Pierce and Kuipers 1997]. It is
possible to derive groups from the structure of the environment, by estimating objects based
on their state distributions and grouping similar objects. The agent also has information as
to which sensors it contains and could simply group individual sensors. For these experiments
though we use the following handcrafted groupings:

• agent variables (direction, x-position, y-position)

• object variables (key collected, door opened)

• vision variables (vision cell 1, ..., vision cell n)

31


Since the domains that we consider have two sets of key-door pairs, we have two object
groups, one for each pair. We note that we group all vision cells together as in this case our
vision variables can be considered equivalent to a single front-facing camera on the agent. There
may be other cases, such as if the agent had one front-facing and one rear-facing camera, where
it would be reasonable to split the vision variables into two groups or more.

For these experiments, we consider two potential cases for the variable threshold, applying
the grouping if the option depends on at least 50% of the group (t = 0.5), or applying the
grouping if the option depends on at least one single variable from the group (t = 1

group size).
For example, if a vision sensor consists of 20 individual, concurrent readings or variables, then
in the first case if 10 or more of these variables are considered important, then all readings from
this sensor will be added to the factored set. In the second case, only one variable from this
group is required to be considered important for the threshold to be reached and the entire
sensor grouping to be added to the factored set. Again, here our threshold selections remain
quite simplistic and there is room for developing a more elaborate formulation for selecting the
threshold in future work.

Figures 4.6 and 4.7 show the performance of each of the two grouping methods separately,
while Figure 4.8 provides a comparison between the two methods. The comparison figure shows
the performance for both methods for 1, 3 and 6 tasks, with the single variable threshold
indicated with solid lines and the half group threshold by the solid lines. From this comparison
we observe that grouping with only one item in the group learns faster initially (shown by 1
task and 3 tasks performance), but converges to a lower performance than that of the method
of grouping when the factoring contains at least half of the group (6 tasks performance).

From the factoring obtained we see that this occurs because initially, occasionally more
than half of the vision grid cells are often not considered important. In these cases the half
group threshold struggles, but usually at least one cell is considered important and so the single
variable threshold is able to cope early in the learning process. As the agent sees more training
domains it becomes more likely that more than half of the vision state variables will be considered
important and thus grouped to the factoring. On the other hand, the single item grouping is
able to deal with these early training cases, but ends up marking many of the irrelevant variables
as important. For example, if the xPosition is considered important, then it will group all of
the other agent related variables as well, which means these factored options will struggle to
generalise. We can see benefits from both methods, which once again reinforces the idea of
developing a more complex method for selecting the threshold.

While running these experiments we found that the grid size of the agent’s locally observable
view played a large role in performance as well. Intuitively, the smaller the grid size, the more
likely it is to generalise across new domains as it requires a smaller area to be similar between the
two domains. On the other hand, the performance benefit received from the options decreases
when dealing with a smaller grid size as the agent moves a shorter distance when executing each
option. For example, with a 3×3 grid the agent can only execute the option to get the key if
the key variable is visible in that grid, and so at most 3 grid cells away.

We test this intuition by comparing the performance of agents with grids of size 3×3, 5×5 and
7×7, and found that the 5×5 grid displays the greatest benefit in terms transfer performance.
Thus for the experiments used in this chapter that make use of the visible grid, we utilised a
grid size of 5×5. For the individual learning curves from these experiments, see Appendix B.

32


(a) Average reward per episode after 1, 3 and 6 training tasks given to our framework, compared
with the optimal option performance. Shaded areas indicate a single standard deviation.

(b) Learning curves for all number of tasks without confidence intervals for the sake of clarity

Figure 4.6: Average reward per episode when applying a grouping threshold of 1 item in the
group. Graphs are plotted using a rolling average with a window size of 10 episodes and are
averaged over 20 random seeds.

33


(a) Average reward per episode after 1, 3 and 6 training tasks given to our framework, compared
with the optimal option performance. Shaded areas indicate a single standard deviation.

(b) Learning curves for all number of tasks without confidence intervals for the sake of clarity

Figure 4.7: Average reward per episode when applying a grouping threshold of at least half of
the items in the group. Graphs are plotted using a rolling average with a window size of 10
episodes and are averaged over 20 random seeds.

34


Figure 4.8: Comparison between 1, 3 and 6 training tasks for both grouping methods. The
dotted lines show performance of the grouping threshold of 1 item, while the solid lines indicate
the grouping threshold of at least half of the items.

4.5 Redundant and Noisy Variables

Next we explore two specific situations which are more akin to what an agent would encounter
in real-world scenarios – domains containing multiple noisy or redundant variables. First, we
test the case where the domain contains multiple redundant variables, but remains determinis-
tic. If we consider agent’s in real-world settings, there are typically a great number of objects
present that are completely irrelevant to solving specific tasks. Thus if we wish to build a fully
autonomous agent, it is important that they are still able to to transfer their learned skills to
new environments where the configurations of these redundant variables or objects may differ.
We showed in Section 4.4.1 that our framework can select the relevant variables from a small
state space, but now we demonstrate that it is still able to do so, even when the complexity of
the domain increases.

We set up our experiment in a similar fashion to the previous experiments with the local
agent view, using a 5×5 vision grid. To generate redundant variables we append the distance
between the agent and randomly selected wall cells to the state observation. We can consider
this as equivalent to an agent that has information on the distance to various objects in the
domain available, although most objects are not relevant to the current task. Given this new
observation, we run our framework as described previously to learn a set of factored options.
We then evaluate these options in a set of test domains, along with other baseline algorithms.
We adjust the episode length of these experiments from 500 to 1000, so that the change in
performance as additional variables are added can be clearly seen.

35


Figure 4.9 shows the total reward accumulated for each algorithm, for various numbers of
redundant variables, calculated as the area under the curve for the each of the reward curves. We
observe that the DOORMAX algorithm (labelled OOMDP) is able to scale well and outperforms
any of the option based methods. Despite this, the DOORMAX approach becomes infeasible
as the domain size grows, as it becomes increasingly more time consuming to manually add
new objects and their attributes to the OO-OMDP. In contrast, our framework also displays
the ability to scale with the addition of redundant variables, maintain performance as more
redundant variables are added, but does not require any additional data as new variables are
added. It is also observed that the unfactored options and regular Q-Learning without options
struggle and their performance quickly diminishes as the complexity of the domain increases.

Figure 4.9: Comparison of total reward accumulated by different algorithms for domains con-
taining various numbers of redundant variables. The bars on each point indicate one standard
deviation, as each experiment is averaged over 10 random seeds.

Next we consider the case where the domain contains noisy, stochastic variables. We can
consider this as equivalent to the case where an agent has a faulty sensor which provides it
with noisy readings or if the sensor is capturing a particular variable which inherently contains
noise. In real-world settings, this is a common occurrence and it is both unreasonable and often
infeasible to expect a human to prune out these noisy readings and manually select the relevant
variables to focus on for each experiment. Again we expect the ability to effectively deal with
such scenarios to be a key characteristic of fully autonomous systems.

To generate these noisy variables we draw a sample from a normal distribution with mean
µ = 10 and standard deviation σ = 10, where the value is rounded to 2 decimal places. This
reading is then appended to the agent’s state. For each step that the agent takes, a new
value is sampled and used to update the noisy variable in the state space. These variables are
independent of the state, unlike the previous experiment in which state features were used to
generate the additional variables.

36


Figure 4.10 shows the total reward accumulated for each algorithm, for various numbers of
noisy variables. We can see that given a stochastic domain with noise, the DOORMAX al-
gorithm struggles to solve the domain. This is known to be one of the main limiting factors
of DOORMAX, as it requires the transition dynamics of each action and attribute to be rep-
resentable as a full binary tree with effects at the leaf nodes and propositions at the non-leaf
nodes. In non-deterministic domains it is not possible to represent the transition dynamics in
this way and so the correct dynamics cannot be learned [Diuk et al. 2008].

Similarly, the unfactored options and standard Q-Learning perform poorly in this setup,
particularly as more variables are added and the complexity of the domain increases. In contrast,
our algorithm demonstrates that it is able to scale well in the case of stochastic domains with
noisy variables. We see that even as the complexity of the domains increase, we are able to
maintain performance by learning to ignore the noisy variables.

Figure 4.10: Comparison of total reward accumulated for domains containing various numbers
of noisy variables. The bars on each point indicate one standard deviation, as each experiment
is averaged over 10 random seeds

We provide the individual learning curves used to generate the total reward graphs seen in
Figures 4.9 and 4.10 in Appendix C.

4.6 AI2-THOR

While experiments in the Minigrid domain have been useful as abstractions for testing sepa-
rate components of our framework, we now show its effectiveness in a more realistic domain.
AI2-Thor [Kolve et al. 2017] is a framework consisting of 120 interactive room-based environ-
ments in which an agent can act. These include kitchen, bedroom, bathroom and living room
environments, which contain objects that can be manipulated and interacted with by an agent.

37


After each action taken in the environment, an Event is returned containing metadata per-
taining to the agent, objects and the current scene that the agent is in. We extract specific
metadata to create our state space that is used during these experiments. The state space is
composed of the agent’s position (x and y co-ordinates), the rotation of the agent and metadata
for each object. The object metadata includes the object location, whether the object is visible
to the agent or not, and whether the object is currently being held by the agent (if the object
is marked as pickupable).

The action space available to the agent is as follows: RotateRight which rotates the agent
90◦ to the right, RotateLeft which rotates the agent 90◦ to the left, MoveAhead which moves
the agent forward by a fixed amount defined by the gridSize parameter which we set to 0.25,
PickupObject which will pick up the closest object within the agent’s visibilityDistance that has
the pickupable attribute and PutObject which will place the held object into a valid receptacle
within the agent’s visibilityDistance. In the case where there are no pickupable objects within
the agent’s visibility, or there are no valid receptacle to place a held object, executing either of
these actions will result in no change in state. Although we have simplified our action space to
be relevant to the specific tasks we give to our agent, the AI2-Thor framework provides several
other actions which could be added in the future when dealing with more complex tasks, such
as Clean Object or Slice Object used when cooking. There are also additional actions move the
agent’s camera angles, but for these experiments we fix the camera with cameraHorizon = 0.

For our experiments we work with the 30 kitchen environments, which we split into 20
training environments and 10 test environments. Figure 4.11 shows an example of one of the
environments from the agent’s view, as well as a top down view of the environment. These
environments consist of a 16×16 grid that the agent can move across, although not all grid
locations are reachable as some contain counters or other objects that the agent cannot move
through. Each kitchen environment contains several objects with which the agent can interact.
We remove certain objects such as cabinets, drawers and windows from the state space to simplify
the domain slightly. Table 4.2 shows a list of all possible objects for a given environment,
along with their attributes (whether they are visible, distance from agent, can be picked up
or placeable). Note that this is not the same as the attributes of an OO-MDP class, as these
variables are provided directly by the environment and we simply concatenate them together
to form the state space. There is no prior knowledge provided by the environment and so the
variables are not grouped into classes according to similar effects as with OO-MDP objects.

Object: Apple Bowl Bread CounterTop Cup Egg Fork Fridge
visible X X X X X X X X

isPickedUp X X X X X X
distance X X X X X X X X

receptacle X X
Object: Microwave Mug Pan Plate Pot StoveBurner Spoon Toaster
visible X X X X X X X X

isPickedUp X X X X X X
distance X X X X X X X X

receptacle X X

Table 4.2: Table showing the object classes and attributes for the AI2-Thor environments. The
state space is defined as a combination of these attributes concatenated into a single vector.
Note that not all environments will always contain all object and that the amount of each
object present may differ across environments.

38


Figure 4.11: Top-down view of an example AI2-Thor environment, alongside the view provided
to the agent by its camera. The agent’s location can be seen in the top left corner of the
top-down view, circled in red.

These environments provide a 4 times larger grid for the agent to move across, compared
to that of Minigrid. The state space is also 12 times larger than Minigrid, due to all of the
additional attributes that each object in the environment consists of. Despite these domains
being technically large-scale gridworld domains, we believe that they are able to provide a more
accurate representation of a real-world environment. This is because at a high level an agent
in a real-world environment can treat it as a gridworld, just at a much finer granularity. If
the agent moves and rotates at fixed increments, then this movement can be visualised as the
agent moving over a very fine grid mapped over the environment. AI2-Thor provides similar
functionality, whereby the agent can rotate by any specified number of degrees and move forward
by a set distance. Although we do not make use of such fine actions for these experiments, they
provide an interesting avenue to explore in future work.

We provide the agent with two sequential sub-tasks that may be required when attempting
to cook an egg:

1. Pick up the pan

2. Place the pan on the stove

The reward function is defined as 100 once both sub-tasks have been completed, and −1
elsewhere. Again it is assumed that the agent has initial options available (pickupPan, placePan)
for solving the first training environment. We then setup our experiment in a similar fashion to
the Minigrid experiments. First we randomly select a set of six training environments, which are
given to the agent to solve. The agent generates trajectories when exploring these environments,
and these trajectories are used to learn our factored options. We then evaluate the performance
of these options for each of a number of seen training environments, by transferring the learned
factored options to three randomly selected test environments and averaging the performance.
Since the start location of the agent plays a large role in these environments as to how quickly
it can get to the pan, we randomise the start location for each run and average over 10 runs for
each test environment.

39


To evaluate the performance of the learned factored options we make use of SMDP Q-
Learning [Sutton et al. 1999] with discount factor γ = 0.9, learning rate α = 0.1 and ε-greedy
action selection with ε = 0.1. We train the agent for 2000 episodes with a maximum episode
length of 500 steps. Figure 4.12 shows the average episode reward for our factored options for
various numbers of training tasks, compared to the optimal option performance and learning
without options. We can see that the performance increases as the agent is given new training
domains and converges close to the performance of the optimal options for the test results. Our
factored options are still limited by the initial environment in which they were learned, which
makes it difficult for them to reach the optimal performance. For example, if in the first training
environment the agent had a large open space in front of the stove area, then the placePan option
learned from this environment will struggle to generalise to a new environment where there are
obstacles in front of the stove, and the agent may get stuck on these obstacles. Nevertheless,
our factored options still display a clear benefit in accelerating learning in new environments.

Due t