3. Electronic Theses and Dissertations (ETDs) - All submissions
Permanent URI for this communityhttps://wiredspace.wits.ac.za/handle/10539/45
Browse
5 results
Search Results
Item Dynamics generalisation in reinforcement learning through the use of adaptive policies(2024) Beukman, MichaelReinforcement learning (RL) is a widely-used method for training agents to interact with an external environment, and is commonly used in fields such as robotics. While RL has achieved success in several domains, many methods fail to generalise well to scenarios different from those encountered during training. This is a significant limitation that hinders RL’s real-world applicability. In this work, we consider the problem of generalising to new transition dynamics, corresponding to cases in which the effects of the agent’s actions differ; for instance, walking on a slippery vs. rough floor. To address this problem, we introduce a neural network architecture, the Decision Adapter, which leverages contextual information to modulate the behaviour of an agent, depending on the setting it is in. In particular, our method uses the context – information about the current environment, such as the floor’s friction – to generate the weights of an adapter module which influences the agent’s actions. This, for instance, allows an agent to act differently when walking on ice compared to gravel. We theoretically show that our approach generalises a prior network architecture and empirically demonstrate that it results in superior generalisation performance compared to previous approaches in several environments. Furthermore, we show that our method can be applied to multiple RL algorithms, making it a widely-applicable approach to improve generalisationItem Improving reinforcement learning with ensembles of different learners(2021) Crafford, GerrieDifferent reinforcement learning methods exist to address the problem of combining multiple dif ferent learners to generate a superior learner, from ensemble methods to policy reuse methods. These methods usually assume that each learner uses the same algorithm and/or state represen tation and often require learners to be pre-trained. This assumption prevents very different types of learners, that can potentially complement each other well, from being used together. We propose a novel algorithm, Adaptive Probabilistic Ensemble Learning (APEL), which is an ensemble learner that combines a set of base reinforcement learners and leverages the strengths of the different base learners online, while remaining agnostic to the inner workings of the base learners, thereby allowing it to combine very different types of learners. The ensemble learner selects the base learners that perform best on average by keeping track of the performance of the base learners and then probabilistically selecting a base learner for each episode according the historical performance of the base learners. Along with a description of the proposed algorithm, we present a theoretical analysis of its behaviour and performance. We demonstrate the proposed ensemble learner’s ability to select the best base learner on av erage, combine the strengths of multiple base learners, including Q-learning, deep Q-network (DQN), Actor-Critic with Experience Replay (ACER), and learners with different state repre sentations, as well as its ability to adapt to changes in base learner performance on grid world navigation tasks, the Cartpole domain, and the Atari Breakout domain. The effect that the en semble learner’s hyperparameter has on its behaviour and performance is also quantified through different experiments.Item Skill discovery from multiple related demonstrators(2018) Ranchod, PraveshAn important ability humans have is that we can recognise that some collec tions of actions are useful in multiple tasks, allowing us to exploit these skills. A human who can run while playing basketball does not need to relearn this ability when he is playing soccer as he can employ his previously learned run ning skill. WeextendthisideatothetaskofLearningfromDemonstration(LfD),wherein an agent must learn a task by observing the actions of a demonstrator. Tradi tional LfD algorithms learn a single task from a set of demonstrations, which limits the ability to reuse the learned behaviours. We instead recover all the latentskillsemployedinasetofdemonstrations. Thedifficultyinvolvedliesin determiningwhichcollectionsofactionsinthedemonstrationscanbegrouped together and termed “skills”? We use a number of characteristics observed in studies of skill discovery in children to guide this segmentation process – use fulness (they lead to some reward), chaining (we tend to employ certain skills in common combinations), and reusability (the same skill will be employed in many different contexts). Weusereinforcementlearningtomodelgoaldirectedbehaviour,hiddenMarkov models to model the links between skills, and nonparametric Bayesian cluster ing to model reusability in a potentially infinite set of skills. We introduce nonparametric Bayesian reward segmentation (NPBRS), an algorithm that is abletosegmentdemonstrationtrajectoriesintocomponentskills,usinginverse reinforcement learning to recover reward functions representing the skill ob i jectives. We then extend the algorithm to operate in domains with continuous state spaces for which the transition model is not specified, with the algorithm suc cessfully recovering component skills in a number of simulated domains. Fi nally, we perform an experiment on CHAMP, a physical robot tasked with mak ingvariousdrinks,anddemonstratethatthealgorithmisabletorecoveruseful skills in a robot domain.Item Representation discovery using a fixed basis in reinforcement learning(2016) Wookey, Dean StephenIn the reinforcement learning paradigm, an agent learns by interacting with its environment. At each state, the agent receives a numerical reward. Its goal is to maximise the discounted sum of future rewards. One way it can do this is through learning a value function; a function which maps states to the discounted sum of future rewards. With an accurate value function and a model of the environment, the agent can take the optimal action in each state. In practice, however, the value function is approximated, and performance depends on the quality of the approximation. Linear function approximation is a commonly used approximation scheme, where the value function is represented as a weighted sum of basis functions or features. In continuous state environments, there are infinitely many such features to choose from, introducing the new problem of feature selection. Existing algorithms such as OMP-TD are slow to converge, scale poorly to high dimensional spaces, and have not been generalised to the online learning case. We introduce heuristic methods for reducing the search space in high dimensions that significantly reduce computational costs and also act as regularisers. We extend these methods and introduce feature regularisation for incremental feature selection in the batch learning case, and show that introducing a smoothness prior is effective with our SSOMP-TD and STOMP-TD algorithms. Finally we generalise OMP-TD and our algorithms to the online case and evaluate them empirically.Item Reinforcement learning with parameterized actions(2016) Masson, Warwick AnthonyIn order to complete real-world tasks, autonomous robots require a mix of fine-grained control and high-level skills. A robot requires a wide range of skills to handle a variety of different situations, but must also be able to adapt its skills to handle a specific situation. Reinforcement learning is a machine learning paradigm for learning to solve tasks by interacting with an environment. Current methods in reinforcement learning focus on agents with either a fixed number of discrete actions, or a continuous set of actions. We consider the problem of reinforcement learning with parameterized actions—discrete actions with continuous parameters. At each step the agent must select both which action to use and which parameters to use with that action. By representing actions in this way, we have the high level skills given by discrete actions and adaptibility given by the parameters for each action. We introduce the Q-PAMDP algorithm for model-free learning in parameterized action Markov decision processes. Q-PAMDP alternates learning which discrete actions to use in each state and then which parameters to use in those states. We show that under weak assumptions, Q-PAMDP converges to a local maximum. We compare Q-PAMDP with a direct policy search approach in the goal and Platform domains. Q-PAMDP out-performs direct policy search in both domains.