Leveraging prior knowledge for sample efficient reinforcement learning

No Thumbnail Available

Date

2021

Authors

Marom, Ofir

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

A major goal in the field of Artificial Intelligence (AI) is the construction of au tonomous agents that make effective decisions to complete real-world tasks. A promising mathematical framework to achieve this longstanding goal in AI is through the Reinforcement Learning (RL) paradigm. In RL an agent learns to maximise the long-term expected rewards it can receive from interacting with its environment. The agent achieves this by learning a policy that determines which optimal action to take in any given state that lead to such a maximal payoff. For reasons of computational tractability, an RL task is typically described in terms of a Markov Decision Process (MDP). While there exist many algorithms with prov able guarantees of convergence to an optimal policy given an MDP representation of an RL task, in practice there are numerous challenges in RL that limit its appli cability for practical applications. One of the keys to overcome these challenges is to improve the sample efficiency of RL algorithms. In RL, a sample refers to some action taken by the agent in its environment that leads the agent to gain some experience. The agent learns from this experience and, with enough samples, is able to learn an optimal policy for behaving in the environment. A sample efficient algorithm would extract maximum value from the smallest number of samples, thus improving the convergence rate to an optimal policy. This thesis takes a step towards improving the sample efficiency of RL algorithms by leveraging prior knowledge. The term prior knowledge refers to any knowledge that the agent has about a task before starting to solve that task. Various techniques have demonstrated that injecting useful prior knowledge into RL algorithms can dramatically improve sample efficiency. A natural question that arises with regards to prior knowledge is how such knowl edge can be attained in the first place. While many techniques in RL treat prior knowledge as given, it is ideal if this knowledge can be learned. The research area of transfer learning provides a mechanism to learn prior knowledge. In the trans ii fer learning research problem, an agent attains prior knowledge from solving some source tasks that can then be applied to a related target task to improve learning performance. The main contributions of this thesis are therefore as follows: firstly, we present a framework called Belief Reward Shaping (BRS) that can be incorporated with any model-free RL algorithm. BRS leverages prior knowledge about a task by aug menting the environment reward distribution with a prior distribution that encodes prior beliefs about useful sub-tasks to complete within a task. Provided this prior distribution encodes knowledge that is useful for solving the task, the agent can solve the task in a more sample efficient manner as the agent is guided by this prior knowledge to achieve specific sub-tasks within the task. Secondly, we present an object-oriented formalism for RL called the Deictic Object Oriented MDP (Deictic OO-MDP). Deictic OO-MDPs are based on the notion of a deictic predicate, which is a predicate that is grounded with respect to a single reference object that relates itself to lifted object classes. For certain domains, it is possible to use the Deictic OO-MDP formalism to efficiently learn a model of the transition dynamics of an environment that transfers across all tasks of a given domain. In addition, we extend the previously introduced Propositional OO MDP formalism to learn a transferable model of the reward dynamics as well. This improves sample efficiently because the agent can reuse these models for any given task of the domain, rather than having to waste samples relearning it for each task. Thirdly, we present an algorithm to efficiently learn likely-admissible heuristics from some source tasks of a domain. Once learned, the heuristic can be transferred to be used with a heuristic-based planning algorithm to produce likely-optimal plans for a new target task of the domain. Our approach utilises the notion of epistemic and aleatoric uncertainty to achieve this. We use epistemic uncertainty to efficiently explore task-space and generate source tasks that are of the right level to learn from. This approach to generating source tasks is more sample efficient than previous approaches that generated source tasks randomly. We further combine epistemic and aleatoric uncertainty to ensure that when the heuristic is used to solve source tasks during training, it plans with a value that is likely-admissible. This ensures that the final heuristic produces likely-optimal pans with low suboptimalit

Description

A thesis submitted in fulfilment of the requirements for the degree Doctor of Philosophy to the Faculty of Science, School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, 2021

Keywords

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By