Leveraging prior knowledge for sample efficient reinforcement learning
No Thumbnail Available
Date
2021
Authors
Marom, Ofir
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
A major goal in the field of Artificial Intelligence (AI) is the construction of au tonomous agents that make effective decisions to complete real-world tasks. A
promising mathematical framework to achieve this longstanding goal in AI is through
the Reinforcement Learning (RL) paradigm. In RL an agent learns to maximise the
long-term expected rewards it can receive from interacting with its environment.
The agent achieves this by learning a policy that determines which optimal action
to take in any given state that lead to such a maximal payoff.
For reasons of computational tractability, an RL task is typically described in terms
of a Markov Decision Process (MDP). While there exist many algorithms with prov able guarantees of convergence to an optimal policy given an MDP representation
of an RL task, in practice there are numerous challenges in RL that limit its appli cability for practical applications.
One of the keys to overcome these challenges is to improve the sample efficiency
of RL algorithms. In RL, a sample refers to some action taken by the agent in its
environment that leads the agent to gain some experience. The agent learns from
this experience and, with enough samples, is able to learn an optimal policy for
behaving in the environment. A sample efficient algorithm would extract maximum
value from the smallest number of samples, thus improving the convergence rate to
an optimal policy.
This thesis takes a step towards improving the sample efficiency of RL algorithms by
leveraging prior knowledge. The term prior knowledge refers to any knowledge that
the agent has about a task before starting to solve that task. Various techniques
have demonstrated that injecting useful prior knowledge into RL algorithms can
dramatically improve sample efficiency.
A natural question that arises with regards to prior knowledge is how such knowl edge can be attained in the first place. While many techniques in RL treat prior
knowledge as given, it is ideal if this knowledge can be learned. The research area
of transfer learning provides a mechanism to learn prior knowledge. In the trans ii
fer learning research problem, an agent attains prior knowledge from solving some
source tasks that can then be applied to a related target task to improve learning
performance.
The main contributions of this thesis are therefore as follows: firstly, we present
a framework called Belief Reward Shaping (BRS) that can be incorporated with
any model-free RL algorithm. BRS leverages prior knowledge about a task by aug menting the environment reward distribution with a prior distribution that encodes
prior beliefs about useful sub-tasks to complete within a task. Provided this prior
distribution encodes knowledge that is useful for solving the task, the agent can
solve the task in a more sample efficient manner as the agent is guided by this prior
knowledge to achieve specific sub-tasks within the task.
Secondly, we present an object-oriented formalism for RL called the Deictic Object Oriented MDP (Deictic OO-MDP). Deictic OO-MDPs are based on the notion of
a deictic predicate, which is a predicate that is grounded with respect to a single
reference object that relates itself to lifted object classes. For certain domains,
it is possible to use the Deictic OO-MDP formalism to efficiently learn a model
of the transition dynamics of an environment that transfers across all tasks of a
given domain. In addition, we extend the previously introduced Propositional OO MDP formalism to learn a transferable model of the reward dynamics as well. This
improves sample efficiently because the agent can reuse these models for any given
task of the domain, rather than having to waste samples relearning it for each task.
Thirdly, we present an algorithm to efficiently learn likely-admissible heuristics from
some source tasks of a domain. Once learned, the heuristic can be transferred to
be used with a heuristic-based planning algorithm to produce likely-optimal plans
for a new target task of the domain. Our approach utilises the notion of epistemic
and aleatoric uncertainty to achieve this. We use epistemic uncertainty to efficiently
explore task-space and generate source tasks that are of the right level to learn from.
This approach to generating source tasks is more sample efficient than previous
approaches that generated source tasks randomly. We further combine epistemic
and aleatoric uncertainty to ensure that when the heuristic is used to solve source
tasks during training, it plans with a value that is likely-admissible. This ensures
that the final heuristic produces likely-optimal pans with low suboptimalit
Description
A thesis submitted in fulfilment of the requirements for the degree Doctor of Philosophy to the Faculty of Science, School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, 2021