The Effects of Node Removal on Bayesian Network Resilience for ATM Network Transaction Vulnerabilities Gcobisile Matafeni 709637 A dissertation submitted to the Faculty of Science, University of the Witwatersrand, in fulfillment of the requirements for the degree of Master of Science. Supervised by Prof. Ritesh Ajoodha and Dr. Seun Olukanmi May 30, 2025 Abstract We investigate the evaluation of influence relationships in probabilistic graphical models, focusing on the impact of node removal (mutilation) within Bayesian networks. The central problem addressed is understanding how the joint probability distribution and influence structure among interconnected variables evolve when a subset of nodes is removed, an issue relevant to various real-world systems experiencing disruptions. We model these dynamics using Bayesian learning to provide insights into network resilience and dependencies. To explore these effects, we generate synthetic Bayesian network structures that are tree-like, sparse, and dense, each representing different real-world configurations found in machine learning, sensor networks, and financial modeling. Conditional Probability Distributions (CPDs) were assigned to nodes based on the Bernoulli distribution. The Kullback-Leibler (KL) divergence quantified the deviations in influence structures post-removal, with evaluation of structure recovery employing an exact inference technique. Our findings indicate that each network type exhibits distinct responses to node removal: tree-like structures stabilize quickly with increased data, sparse structures show higher sensitivity but recover efficiently, and dense structures offer robustness through redundancy, though they demand larger datasets. These findings have significant implication for optimizing complex systems, particularly those requiring resilient network architectures. As a real-world application, we model ATM transaction networks to analyze how the removal of ATMs (due to vandalism, load shedding, or maintenance) impacts transaction flows. Our results show that high-traffic ATMs serve as critical nodes, significantly influencing neighboring ATMs when removed. By applying Bayesian structure learning, we demonstrate that optimal ATM network configurations can be identified to minimize disruption and improve financial service resilience. This study contributes to the growing field of probabilistic graphical models by introducing a novel approach to understanding influence dynamics in mutilated networks. It provides practical insights and lays a foundation for further research into complex systems where node integrity and network stability are critical for decision-making and operational efficiency. Keywords: Bayesian network, probabilistic graphical models, mutilation, influence relationships, node removal, ATM network resilience, Kullback-Leibler divergence, structure learning i Declaration I, Gcobisile Matafeni, hereby declare the contents of this masters dissertation to be my own work. This dissertation is submitted for the degree of Master of Science in Computer Science at the University of the Witwatersrand. This work has not been submitted to any other university, or for any other degree. Signature: Gcobisile Matafeni ii Acknowledgements I would like to express my sincere gratitude to my supervisor, Prof. Ritesh Ajoodha, and my co- supervisor, Dr. Seun Olukanmi, for their unwavering support, invaluable guidance, and continuous motivation throughout my research journey. Their expert insights and constructive feedback have been instrumental in shaping the direction and quality of this dissertation. I am also deeply grateful to Rudzani Mulaudzi from the Explainable AI Lab, whose assistance with my coding challenges provided crucial breakthroughs when I felt stuck. His willingness to share his expertise significantly enhanced my understanding and implementation of the experimental framework. My heartfelt thanks go to Ankan, Ankur, Abinash, and Panda for their efforts in developing the Python pgmpy framework, which served as the foundation for the experimental implementation in this study. Their work not only facilitated my research but also enriched my practical experience with ad- vanced probabilistic graphical models. Finally, I wish to acknowledge my family and friends, whose continuous encouragement, under- standing, and support have been a constant source of strength. I am especially thankful to my family, to whom I dedicate this dissertation, for believing in my potential and for always being there for me. iii Contents Abstract i Declaration ii Acknowledgements iii Table of Contents iv Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 Introduction 2 2 Background and Related Work 4 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Reasoning Patterns in Bayesian Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Forecasting Techniques for ATM Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Optimizing Cash Flow and Resource Allocation . . . . . . . . . . . . . . . . . . . . . . . 11 2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.1 Mapping Influence in Complex Systems . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3 Methodology 17 3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Research Questions and Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Application Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.1 Modeling Influence Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Mutilating the Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.6 Modeling the Influence of Transaction Volumes between ATMs . . . . . . . . . . . . . . 26 3.6.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.6.2 Data and Ethical Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.6.3 Practical Implementation and Challenges . . . . . . . . . . . . . . . . . . . . . . . 28 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4 Experimental Results and Discussion 30 iv 4.1 Research Questions 1 and 2: What is the impact on the KL divergence when nodes are removed from a recovered Bayesian network? & What are the Implications of Varying Structural Complexity on the Recovery of Bayesian Networks and their KL Divergence? 30 4.1.1 Tree-like Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.2 Sparse Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1.3 Dense Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.4 Comparative Analysis and Synthesis Across Network Models . . . . . . . . . . . 51 4.2 Research Question 3: How can Insights from Bayesian Networks Improve the Opera- tional Efficiency and Planning of ATM Networks? . . . . . . . . . . . . . . . . . . . . . . 52 4.2.1 Comparing Original and Reduced Networks . . . . . . . . . . . . . . . . . . . . . 61 4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5 Conclusion and Future Work 64 References 70 v List of Figures 1.1 A bombed/vandalised Automated Teller Machine (The Citizen, 2017). . . . . . . . . . . . 2 2.1 Different influence structures that depict the influence relationships of node Z with other nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 A Bayesian network influence structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 A Bayesian network influence structure with a removed node (Z). . . . . . . . . . . . . . 5 3.1 An influence structure between stochastic processes. The red arrows indicate the in- fluence between these processes while the solid arrows indicate the parameters of each node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 A visual representation of a dataset with θ representing the parameters. . . . . . . . . . . 22 3.3 An influence structure between stochastic processes with a reduced graph structure. . . . 24 3.4 A distribution of ATM networks in the Johannesburg region for a South African bank. The yellow circles with white stars in the center indicate an individual ATM, the red crosses on these ATMs indicate a hypothetical event of removing an ATM from a network. 27 4.1 10 randomly generated tree-like DAGs with random edge orientations and random Bernoulli CPDs assigned to each node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 A snapshot of the top 10 samples that are generated from one of the GTD DAGs. . . . . 32 4.3 A plot of the training data versus the KL divergence for GTDs with 150 nodes comparing the Tree Search and MWST algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4 A plot of the training data versus the KL divergence for GTDs with 10 nodes using MWST for structure learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.5 A plot of the training data versus the KL divergence for GTDs with 10 nodes using MWST for structure learning with mutilation. . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.6 A plot of the training data versus the KL divergence for GTDs with 50 nodes using MWST for structure learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.7 A plot of the training data versus the KL divergence for GTDs with 50 nodes using MWST for structure learning with mutilation. . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.8 A plot of the training data versus the KL divergence for GTDs with 100 nodes using MWST for structure learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.9 A plot of the training data versus the KL divergence for GTDs with 100 nodes using MWST for structure learning with mutilation. . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.10 A plot of the training data versus the KL divergence for GTDs with 150 nodes using MWST for structure learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.11 A plot of the training data versus the KL divergence for GTDs with 10 nodes using MWST for structure learning with mutilation. . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.12 10 randomly generated sparse DAGs with random edge orientations and random Bernoulli CPDs assigned to each node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 vi 4.13 A snapshot of the top 10 samples that are generated from one of the GTD DAGs. . . . . 38 4.14 A plot of the training data versus the KL divergence for GTDs with 10 nodes comparing the BIC, KIC and BDScore for scoring while using the Hill Climb Search algorithm for structure learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.15 A plot of the training data versus the KL divergence for GTDs with 10 nodes using Hill Climb Search with BIC for scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.16 A plot of the training data versus the KL divergence for GTDs with 10 nodes using Hill Climb Search with BIC for structure learning with mutilation. . . . . . . . . . . . . . . . 40 4.17 A plot of the training data versus the KL divergence for GTDs with 15 nodes using Hill Climb Search with BIC for scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.18 A plot of the training data versus the KL divergence for GTDs with 15 nodes using Hill Climb Search with BIC for structure learning with mutilation. . . . . . . . . . . . . . . . 42 4.19 A plot of the training data versus the KL divergence for GTDs with 20 nodes using Hill Climb Search with BIC for scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.20 A plot of the training data versus the KL divergence for GTDs with 20 nodes using Hill Climb Search with BIC for structure learning with mutilation. . . . . . . . . . . . . . . . 43 4.21 A plot of the training data versus the KL divergence for GTDs with 25 nodes using Hill Climb Search with BIC for scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.22 A plot of the training data versus the KL divergence for GTDs with 25 nodes using Hill Climb Search with BIC for structure learning with mutilation. . . . . . . . . . . . . . . . 44 4.23 10 randomly generated dense DAGs with random edge orientations and random Bernoulli CPDs assigned to each node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.24 A plot of the training data versus the KL divergence for GTDs with 10 nodes comparing the BIC, KIC and BDScore for scoring while using the Hill Climb Search algorithm for structure learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.25 A plot of the training data versus the KL divergence for GTDs with 10 nodes using Hill Climb Search with BIC for scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.26 A plot of the training data versus the KL divergence for GTDs with 10 nodes using Hill Climb Search with BIC for structure learning with mutilation. . . . . . . . . . . . . . . . 47 4.27 A plot of the training data versus the KL divergence for GTDs with 15 nodes using Hill Climb Search with BIC for scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.28 A plot of the training data versus the KL divergence for GTDs with 15 nodes using Hill Climb Search with BIC for structure learning with mutilation. . . . . . . . . . . . . . . . 48 4.29 A plot of the training data versus the KL divergence for GTDs with 20 nodes using Hill Climb Search with BIC for scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.30 A plot of the training data versus the KL divergence for GTDs with 20 nodes using Hill Climb Search with BIC for structure learning with mutilation. . . . . . . . . . . . . . . . 49 4.31 A plot of the training data versus the KL divergence for GTDs with 25 nodes using Hill Climb Search with BIC for scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.32 A plot of the training data versus the KL divergence for GTDs with 25 nodes using Hill Climb Search with BIC for structure learning with mutilation. . . . . . . . . . . . . . . . 50 4.33 A plot of the total transaction volumes for 156 ATMs starting from January 2021 until November 2024. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.34 A heatmap of transaction volumes for first 30 days for the first 10 ATMs. . . . . . . . . . 53 4.35 A Box Plot of transaction volumes for the first 5 ATMs. . . . . . . . . . . . . . . . . . . . 53 4.36 A plot of the distribution of transaction volumes for ATM 1. . . . . . . . . . . . . . . . . 54 4.37 A snapshot of the quantized ATM transaction data. . . . . . . . . . . . . . . . . . . . . . . 54 4.38 A recovered Bayesian Network for the first 20 ATMs. . . . . . . . . . . . . . . . . . . . . 55 vii 4.39 A recovered Bayesian Network for the first 20 ATMs overlaid on a GIS map with their longitude and latitude. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.40 Evaluation metrics applied on 3 ATMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.41 Reduced graph structure with the removal of ATM 15 . . . . . . . . . . . . . . . . . . . . 56 4.42 Reduced graph structure with the removal of ATM 15 overlaid on a GIS map. . . . . . . 57 4.43 Evaluation metrics on two ATMs after the graph is reduced. . . . . . . . . . . . . . . . . . 57 4.44 A recovered Bayesian Network for the first 156 ATMs. . . . . . . . . . . . . . . . . . . . 58 4.45 Snapshot of a recovered Bayesian Network for 156 ATMs. . . . . . . . . . . . . . . . . . 58 4.46 Evaluation metrics applied on ATM 62, ATM 105 and ATM 3 and ATM 20. . . . . . . . 59 4.47 Reduced graph structure with the removal of ATM 15. . . . . . . . . . . . . . . . . . . . . 59 4.48 Reduced graph structure with the removal of ATM 15. . . . . . . . . . . . . . . . . . . . . 60 4.49 Evaluation metrics applied on ATM 62, ATM 105 and ATM 3 and ATM 20. . . . . . . . 60 5.1 An influence structure between stochastic processes with a reduced graph structure for a temporal model set up. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 viii List of Tables 2.1 A table comparing different forecasting techniques that were used by other authors. . . . 10 2.2 A table comparing different optimization techniques that were used by other authors. . . 13 2.3 A table comparing different influence modeling techniques that were used by other authors. 16 1 Chapter 1 Introduction Numerous scientific phenomena can be conceptualized as stochastic processes, which characterize the evolution of random variables over time or space. Employed for modeling and analysing random occur- rences, stochastic processes offer insights into the interactions of complex systems amidst uncertainty. By treating each process as a distinct stochastic process with variables, we can discern the relationships and influences between them. The relationships between these variables can be expressed with a joint probability distribution which models the influence between these variables. Given a set of these vari- ables and a joint probability distribution that describes the influence between them, how would removing a variable or a subset of variables affect the influence structure and the new joint probability distribution of the remaining variables? A grouping of Automated Teller Machines (ATMs) can be viewed through the lens of stochastic processes. ATMs, which are prevalent in society, serve as key tools for financial transactions. They offer customers the ability to withdraw and deposit money, providing convenience and various other benefits to both banks and customers. Due to their widespread deployment across geographically dispersed locations, ATMs present banks with logistical and financial optimization challenges. Analyzing the connections between ATMs can yield insights applicable to a range of issues encountered by financial institutions. If an ATM or a subset of ATMs in a network of ATMs were to go off due to ATM bombing, load shedding, planned maintenance etc, how would this affect the rest of the ATMs in the same region, would customers move to transact in the nearest ATM, what associations manifest in the data and how would we go about detecting vulnerabilities in the network? Figure 1.1: A bombed/vandalised Automated Teller Machine (The Citizen, 2017). 2 Historically, previous studies examining influences between ATMs have primarily focused on fore- casting transactions and optimizing various operational aspects, such as predicting daily cash demand, scheduling efficient cash replenishment, and strategically placing ATMs, in order to enhance overall network efficiency and customer service. Different techniques have been explored for predicting the daily replenishment of ATMs, namely regression approaches and Long Short-Term Memory. Simutus et al. [2007a] tried to solve the same problem by utilizing neural networks and support vector regression techniques which performed considerably well with forecasting cash demand. Genevois et al. [2015] looked at the ATM location problem concurrently with cash management by utilizing a cash demand model for every ATM in a network. Ajoodha [2018] used dynamic Bayesian networks to model the influence between stochastic pro- cesses by making use of score-based structure learning. Ajoodha [2018] devised an algorithm that recovers the influence relations between a set of stochastic processes. Ajoodha and Rosman [2020] provided a method for learning the influence between processes that are represented by hidden Markov models (HMM) as opposed to dynamic Bayesian networks. In our study, we aim to explore how the influence among variables evolves when a variable or subset of variables is eliminated. This approach has not been previously explored in the existing literature that has been examined. This study aims to explore the below questions by investigating the effect of node removal in Bayesian networks. The key research questions guiding this investigation are as follows: 1. What is the impact on the KL divergence when nodes are removed from a recovered Bayesian network? This question examines how the probabilistic structure and information content of a Bayesian network change when certain nodes and their connections are removed. By quantifying the deviation in learned distributions using KL divergence, we evaluate the network’s ability to recover its original properties post-mutilation. 2. What are the implications of varying structural complexity on the recovery of Bayesian networks and their KL divergence? To answer this, we analyze different types of Bayesian networks start- ing with tree-like, sparse, and dense structures, to determine how structural differences influence resilience to node removal. This comparison helps identify which network types are more robust to disruptions. 3. How can insights from Bayesian networks improve the operational efficiency and planning of ATM networks? Applying Bayesian modeling to ATM transaction networks allows us to study how transaction flows change when ATMs go offline. By identifying critical nodes and their influence, we provide recommendations for optimizing ATM placement and mitigating service disruptions. Our principal contribution is to show how the joint probability distribution and influence structure of a Bayesian network change when a subset of nodes is removed. To accomplish this, we begin by creating a random Bayesian network, where the nodes are connected to form a directed acyclic graph and each node is assigned a Conditional Probability Distribution using a Bernoulli distribution. We then generate random samples to form a dataset, learn the probability patterns, and systematically remove nodes to re-learn the altered patterns. This process allows us to quantify how network influence and structure are affected by node removal. The remainder of the dissertation is structured as follows. Chapter 2 presents the background and related work to Bayesian learning. In the chapter, we dive into detail about the methods that have been used to solve similar problems. The research problem is stated in Chapter 3 as well as the methodology that is used to solve for it. Chapter 4 focuses on the experimental results and discusses the limitations of our work. Chapter 5 wraps up the whole research dissertation. 3 Chapter 2 Background and Related Work 2.1 Introduction The previous chapter introduced the general problem area and the specific problem area for this study. In this chapter we review works relating to modeling the influence between stochastic processes by using graphical models. Section 2.3 provides the relevant related work on how other authors have approached solving similar problems with the first focus being on forecasting transaction volumes, then secondly looking at optimizing the spread of ATMs and lastly how influence modeling has been used for stochastic processes. 2.2 Reasoning Patterns in Bayesian Modeling In the context of graphical models, influence flows in a specific way as demonstrated below on the three nodes with indirect connections; X,Y and Z. If we remove a node, this changes the influence relationships between these set of nodes. Take node Z for an example, the influence relationships to Z can flow from different directions as demonstrated in Figure 2.1 below. Figure 2.1: Different influence structures that depict the influence relationships of node Z with other nodes. Given a Bayesian network structure as demonstrated in Figure 2.2, we know that if we remove node Z, in graphical models it is well understood that influence moves around through the network through the conditional independence associations. 4 Figure 2.2: A Bayesian network influence structure Figure 2.3 below depicts the new influence structure for when node Z is removed from the Bayesian network that is above. We can guarantee independence between variables only when there is no active trail between them. If we consider direct and indirect connections we can still have influence moving around in a graphical model structure. The only time where influence does not move around is when we have a V structure which blocks influence and this is the only exception. The reason influence does not move around in a V structure is because 2 nodes are conditionally independent of each other given the child node is not known. Figure 2.3: A Bayesian network influence structure with a removed node (Z). 5 2.3 Forecasting Techniques for ATM Networks Accurate forecasting of ATM transactions is crucial for efficient financial planning and operational man- agement. Various studies have explored time-series models, machine learning algorithms, and hybrid ap- proaches to predict cash demand and transaction volumes. Traditional statistical models such as ARIMA and VARMAX have been widely used, while recent advancements leverage deep learning techniques like Long Short-Term Memory (LSTM) networks and convolutional neural networks for improved accuracy. The integration of feature engineering and clustering methods has further enhanced forecasting preci- sion, allowing financial institutions to optimize cash replenishment schedules and reduce operational costs. The Auto-regressive Integrated Moving Average (ARIMA) is a forecasting technique that was sug- gested by Box-Jenkins [Jenkins 1970] and this has enjoyed application in a variety of forecasting fields. The ARIMA uses time series data by predicting a value as a linear combination of its past values and past errors [Kamini et al. 2014]. Given a time series model with the form ARIMA(p, d, q), the paper Kamini et al. [2014] proposed an ARIMA model, a Multi-Layer Feed Forward Neural Network (MLFF), Generalized Regression Neural Network (GRNN) and Wavelet Neural Network (WNN) to forecast cash demand in ATMs. In their study, Kamini et al. [2014] used NN5 competition data that contained daily time series data of cash withdrawals from 111 ATMs from different parts of England for a period of 2 years. Firstly, Kamini et al. [2014] used traditional non-chaotic methods for forecasting where missing values were imputed with the data being deseasonalized with a time lag of 7 days. In the second method with non-traditional chaotic approaches, Kamini et al. [2014] reconstructed the phase space by using chaotic parameters, namely, embedding dimension and delay time using a TISEAN tool. The study used correlation dimension and the false nearest neighbor methods to identify chaos in the data. A sequential split was performed on the data for each of the ATM centers. The data was divided into 80% for training the 20% for testing. The best performing model when method 1 was used for foresting was the GRNN with an SMAPE of 23.16%. In this method, the phase space is reconstructed using a delay time of 2 and an embedded dimension of 2, the GRNN also yielded the best Symmetric Mean Absolute Percentage Error (SMAPE) of 14.713%. Rafi et al. [2020] proposed a time series model for predicting cash demand in a network of ATMs for a specific financial institution in Pakistan. Instead of using the standard ARIMA approach as done by Kamini et al. [2014], Rafi et al. [2020] built a Vector Auto Regressive Moving Average with Exogenous Variables model (VAR-MAX) by using the transaction data of each ATM. The study used transaction data of 7 ATMs from the period of June 2013 to December 2015 for a region with significant financial activity. The data was pre-processed for parsing into the model while the Dickey-Fuller test was used to verify the stationary test. The split ratio was 70% for training and 30% for testing. The proposed model used by Rafi et al. [2020] puts into account the exogenous variables such as; holidays, salary week, and weekdays. The data was made stationary by utilizing differentiation techniques. The VARMAX model produced an average SMAPE of 31.4%, this is the same evaluation matrix that was utilized by Kamini et al. [2014] but it is significantly larger compared to their results and it must be noted that they used a different forecasting model and data. Another prominent forecasting technique is the Long Term-Short Memory Networks (LSTMS) and these have been proven recently to outperform state of the art univariate time series forecasting methods [Bandara 2017]. Given a group of similar time series data, the authors Bandara [2017] forecast across time series databases using Long Term-Short Memory Networks. The LSTMS were used on two bench- mark data sets, namely the CIF2016 and NN5 data sets. The CIF2016 dataset consisted of monthly time 6 series data composed of two subgroups, one group was related to the banking industry and the other was an artificially generated series. The forecasts were performed on a total of 72 time series for different months ahead forecasts. The second dataset which was the NN5 data contained 2 years of daily cash withdrawals at various automated teller machines (ATMs) located in the UK. A total of 111 time series data for the ATMs was provided with the requirement that the forecasts must be performed for a prediction horizon of 56 days ahead. The NN5 data comprised of different real world forecasting challenges, such as outliers, step ahead forecasting, missing values, and multiple seasonalities. It must be noted that it was the same data set that was used by Kamini et al. [2014] but a different model was used to solve for the same problem. The evaluation function that was used for both the CIF2016 and NN5 was the Symmetric Mean Absolute Percentage Error (SMAPE). Bandara [2017] proposed a new LSTM method which they called the LTSM.Cluster, and it yielded the following results for the datasets, for CIF2016 the SMAPE was 10.5% and for the NN5 dataset, the SMAPE was 21.6%. In the study by Kamini et al. [2014] the GRNN used in method 2 of their approach outperformed the LTSM.Cluster used by Bandara [2017] by a margin of 6.887%. Andrawis et al. [2011] presented models to forecast 111 time series data which was the same data that was explored by Kamini et al. [2014] and Bandara [2017]. Andrawis et al. [2011] used time series data that span a period of 2 years with the aim of forecasting for a horizon of 56 days. The authors experimented with a combination of multiple forecasts and models for their study. Model pre-testing was done by firstly taking an important step to deseasonalize the time series data. The models were then pre-screened by making a comparison of computational intelligence and linear models. A total of 140 models were experimented with and each of these models had different parameters, pre-screening and post-processing methods. The models were trained on 97 weeks of data and 8 weeks of data were used for testing. A total of 9 models turned out with the best accuracy, contrasting this study with Kamini et al. [2014], it used 4 more models for forecasting which made it a robust study. The combination of models were selected based on their accuracy. To evaluate model accuracy, the Symmetric Mean Absolute Percentage Error (SMAPE) was used, which is the main evaluation metric for the competition. Similarly, Jörg [2011] used the NN5 time series data with hybrid models to forecast for a 56 day horizon. Their contribution to the NN5 competition was placed 6th out of a total of 19 teams. When the data was preprocessed it exhibited a weekly seasonality as evidenced on the autocorrelation function. A total of 3 different models were trained with the combination being a neural network ensemble, an ensemble of nearest trajectory models and lastly a model for the 7 day cycle. A weighted mean was calculated by combining all 3 models and the weights were inversely proportional to the forecast errors of the validation set. The evaluation metric used was the Symmetric Mean Absolute Percentage Error (SMAPE). The combined SMAPE for the validation set was 19.38%. Coyle et al. [2010] utilized self-organizing neural networks on the same NN5 forecasting time series data that has been surveyed on the above literature. The self-organizing fuzzy neural network is a specific type of a fuzzy neural network [Leng and Prasad 2004]. The SOFNN is a neural network with five layers and can self-organize. Coyle et al. [2010] utilized structure learning for the learning process of the SOFNN and for parameter learning. The Symmetric Mean Abolute Percentage Error (SMAPE) was also used a measure to evaluate the SOFNN predictive outcomes, a mean SMAPE of 20.4% was achieved. Rajwani et al. [2018] used exploratory data analysis and LSTMs for ATM cash flow prediction. ATM transaction data was collected for ten ATMs that were distributed in the Karachi region. The data 7 was collected for a period of 2 and a half years with 120 246 rows. A total of 60 features came with the data before being pre-processed. After performing exploratory data analysis and a few transformations of the data, Rajwani et al. [2018] found a few interesting observations about the data. It was noticed that most transactions take place at the beginning of the month and at the end of the month. This can be attributed to the fact that people are most likely to be paid at the end or the start of the month. The split ratio for the experimentation was 70/30, with 70% for training and 30% for testing. This split ratio is similar to the ratio used by Rafi et al. [2020] in their study. The LSTM model had a split ratio of 67/33 and 20 epochs were ran. This yielded a MSE of 0.027 on the training data set and a MSE of 0.028 on the testing dataset. The RMSE indicated that they were 0.027 away from the actual target. Simutus et al. [2007a] explored the use of neural networks and support vector machines to forecast cash demand for ATMs. Simutus et al. [2007a] used simulated and real data for training the models and evaluating their performance. The data was gathered or simulated for a period of 2-3 years for training a reliable ANN and SVR on a total of 15 ATMs. All ATM machines had a separate three layer feed-forward neural network with the input variables being coded values of weekday, day of the month, month of the year, holiday effect value and average daily cash demand. The ANN was trained by utilizing the Levenberg-Marquardt optimization method with the output variable being the cash demand for the ATM for the next day. The data split ratio was 70% for training and 30% for testing. The MAPE of the ANN on the simulated data was 0.76% while for the SVR it was 4.1%. Since there is a total of 15 ATM sites, the authors took the average MAPE for both models and the following were the results; the average MAPE for all 15 ATMs when using the ANN model was 20.6% while for the SVR it was 25.1%. Bhandari and Gill [2016] proposed an integrated technique for forecasting which uses the back- propagation algorithm with a genetic algorithm. In their research article, Simutus et al. [2007a] used ANNs and SVMs, in this article, Bhandari and Gill [2016] used a hybrid neural network for robustness. The proposed hybrid model that used BP/GA started with the collection of data, normalization, feature extraction, training and testing. A combination of 5 input variables were used and these were; day number, week day, weekend day, salary effect and holiday effect. Different population sizes were used with a different number of hidden neurons and iterations. The maximum population size achieved a Mean Absolute Error (MAPE) of 1.85% In addressing the improvement of forecasting cash demand for ATMs, Arabani and Komleh [2018] used a Convolutional Neural Network on data that was provided by an Iranian banking institution. Daily ATM cash withdrawal data was used as input dataset for forecasting. Simutus et al. [2007a] utilized both methods for their study and they added a statistical method and CNNs. The statistical method starts with an F-test until it produces a regression model. The remaining methods split the data for training and testing, all four methods were evaluated by using the Root Mean Squared Error (RMSE). The CNN produced the best results from all 4 methods with an RMSE of 4.74% with cross validation. Fallahfati et al. [2022] also studied the forecasting of ATM cash demand but in their case they looked at 2 scenarios, this being before and after the COVID 19 pandemic. Their aim was to accurately predict ATM cash demand by making use of statistical and machine learning models. The authors identified 3 categories for ATMs based on accessibility and environmental factors. The categories were namely; residential districts, business districts, and shopping and recreation districts. A 3 year period of daily withdrawal data from 2017 to 2020 was used. The 3 year time frame was split by the period in which the COVID-19 pandemic was observed. Parametric and non-parametric approaches were used for forecasting ATM cash demand. The authors used 5 parametric models and 4 non-parametric machine learning models. A total of 12 features were selected based on market dynamics in Tehran, Iran. To 8 evaluate the accuracy of their predictions, the authors used the Mean Squared Error (MSE). The best performing model was the regression Random Forest (RF) approach with an MSE of 0.0101. Simutus et al. [2007a] explored the use of neural networks and support vector machines and sim- ilarly, Ramı́rez and Acuña [2011] also made use of ANNs and SVMs. The authors utilized the NN5 competition data with a focus on daily cash withdrawals for ATMs in England. A total of 691 data points were used for training from 30 ATMs, and 100 data points were used for testing purposes. The standard evaluation metric for the NN5 competition is the SMAPE and the authors utilized the same metric to evaluate their models. The best performing models were the MLP with a mean SMAPE of 21.37%. Gurgul and Suder [2016] studied the calendar and seasonal effects on daily ATM withdrawals for ATMs that are managed by the Euronet network. The study used was based on ATM data for two regions and these were namely, Lesser Poland and Subcarpathia. A total of 222 ATMs were used for a period starting from 2010 January to 2012 December. The authors were able to prove the existence of calendar effects on ATM withdrawals. Different days exhibited different withdrawals and this dependence was on the day of the week, week of the month or the month of the year. Previous authors that are surveyed above mostly use machine learning approaches for forecasting ATM cash demand, Venkatesh et al. [2013] proposed the use of neural networks with the clustering of ATMs by grouping them according to features. The ATMs were clustered by making use of the Levenshtein SAM distances. The authors made use of the NN5 cash withdrawal data for a period of 2 years. A total of four neural network architectures were used and these were; Multi-Layer Feed Forward (MLFF), Wavelet Neural Networks (WNN), Group Method of Data Handling (GMDH) and the Generalised Regression Neural Network (GRNN). The GRNN achieved the the best SMAPE of 23.16% and the overall approached utilized by the authors outperformed the results of Andrawis et al. [2011]. Sarveswararao et al. [2022] studied ATM cash demand for an Indian bank by modeling chaos and making use of deep learning. A total of 50 ATMs were used for the study and 28 of the ATMs were observed to exhibit chaos and the 22 remaining ATMs did not have chaos observed in their data. A combination of a statistical technique with machine learning techniques that ranged from deep learning techniques that are Long Short Term Memory (LSTM) neural networks. A total of 12 different tech- niques were utilized for forecasting ATM cash demand. The deep learning LSTM model outperformed all other models and it achieved a mean SMAPE of 24.17%. As a predecessor to the work done by Simutus et al. [2007a], Simutus et al. [2007b] explored the use of a flexible neural network for forecasting ATM cash demand. The ANN was also trained by making use of the Levenberg-Marquard optimization method and the data was split to a ratio of 70% for training and 30% for testing. Simulation data was generated to depict the behavior of a typical ATM in Kaunas city, in Lithuania. The Mean Absolute Percentage Erros (MAPE) values varied between 15-20% for daily predictions and they varied between 8-15% for weekly predictions. Suder et al. [2007] studied the effectiveness of forecasting methods for ATM withdrawals under different market conditions. Suder et al. [2007] also evaluated whether the pandemic-driven market conditions had an impact on the models for predicting withdrawals similarly to what Fallahfati et al. [2022] explored in their study. The data was gathered from one of the largest ATM operators in Poland for a total of 61 ATMs that consisted of daily withdrawal data for a period of 4 years. A combination of statistical and machine learning methods were used for forecasting ATM cash withdrawals. The XGBoost technique outperformed all the methods that were tested and it produced a combined mean SMAPE of 37.92% for both pre and post the COVID-19 pandemic. ATM demand forecasting focuses on the daily demand of each ATM, and not on the denominations 9 dispensed [du Toit 2011]. A total of 18 ATMs were used to conduct the study on forecasting the demand for denominations. A client normally withdraws an amount at an ATM and receives different denomi- nations that make up the amount. du Toit [2011] focused on forecasting the aggregated denominations instead of forecasting the demand for each denomination. The ATM data was in the form of a time series and it exhibited multiple seasonal patterns of different lengths. The empirical daily dispensing data was collected for a period of 3 years, from January 2007 to November 2009. du Toit [2011] em- ployed a variety of forecasting techniques, namely; regression, decomposition, exponential smoothing and Box-Jenkins methods. du Toit [2011] used the Mean Absolute Percentage Error (MAPE) and the results for each technique were tabulated for comparison. The author developed models for daily, weekly and monthly forecasts. The daily forecasts saw an improvement of 21.67% with the MAPE being 58.01%. The new weekly forecast had a MAPE of 10.55% with an improvement of 18.91% from the previous forecasting tech- nique. Lastly, the MAPE for the monthly forecasts was 3.17% with an improvement of 42.36%. Keyter [2004] used regression analysis to forecast transaction volumes on ATM devices for a South African bank. A variety of variables were taken into account before regression was performed, and this is due to the fact that these variables influence the transaction volumes differently. A final sample of 86 ATM sites was taken into account out of a population size of 1800 ATM sites. A different combination of the features out of a total of 18 features were tested, this was done until the combination producing the highest R2 was found. ATMs can be located at different sites namely, shopping center, industrial areas, garages, convenience stores and business nodes. The results for the shopping centers used 3 features and the average residual was 29% with R2 = 0.85. Table 2.1: A table comparing different forecasting techniques that were used by other authors. Author Data Models Accuracy [Kamini et al. 2014] 111 Time series data GRNN SMAPE= 14.713 [Bandara 2017] 111 Time series data LSTM.Cluster SMAPE = 21.6 [Andrawis et al. 2011] 111 Time series data Combined (all 9 models) SMAPE = 18.95 [Rafi et al. 2020] Transaction data VARMAX SMAPE= 31.4 [Keyter 2004] Data for 86 ATMs Regression Analysis R2 = 0.85 [Rajwani et al. 2018] Transaction data LSTM MSE= 0.028 [Simutus et al. 2007a] Transaction data ANN MAPE= 20.6 [Simutus et al. 2007a] Transaction data SVR MAPE= 25.1 [du Toit 2011] Denomination data Fuzzy ARIMA MAPE= 3.17 [Jörg 2011] 111 Time series data Hybrid models SMAPE= 19.38 [Coyle et al. 2010] 111 Time series data SOFNN SMAPE= 20.4 [Bhandari and Gill 2016] 111 Time series data Hybrid ANN MAPE= 1.85 [Arabani and Komleh 2018] Transaction data CNN RMSE= 4.74 [Fallahfati et al. 2022] Transaction data Regression RF MSE= 0.0101 [Ramı́rez and Acuña 2011] Transaction data MLP SMAPE= 21.37 [Venkatesh et al. 2013] Transaction data GRNN SMAPE= 23.16 [Sarveswararao et al. 2022] Transaction data DL LSTM SMAPE= 24.17 [Simutus et al. 2007b] Transaction data ANN MAPE= 14 − 20 [Suder et al. 2007] Transaction data XGBoost SMAPE= 37.92 10 2.4 Optimizing Cash Flow and Resource Allocation Optimizing cash flow and resource allocation in ATM networks is essential because banks must balance the need to minimize costly operations such as cash transportation, insurance, and cash freezing costs with the imperative of ensuring that ATMs remain adequately stocked to prevent service disruptions and customer dissatisfaction. This presents a Multi-Objective optimization problem, a multi-objective optimization problem (MOP) can be defined as an optimization problem with multiple objectives mea- sured with different performance functions, usually in conflict with each other, and a set of restrictions [Rodriguez and Lozano 2008]. Li et al. [2020] proposed an end to end framework for solving multi- objective optimization problems using deep reinforcement learning(DRL) and termed it DRL-MOA. The MOP are decomposed into scalar optimization sub-problems and then each problem is modeled as a neural network. The Pareto optimal solutions were obtained through the trained neural network models. The commonly known multi-objective traveling salesman problem was solved by using (DRL-MOA) by modeling the sub-problem as a Pointer Network. Armenise et al. [2012] investigated the optimization of ATM cash by using a genetic algorithm in order to produce optimal load strategies. A pool of 30 ATMs was used as a test bench, and the features were; location, position, cash capacity, and usage. Similarly to how we want to reduce financial costs, the paper focused on reducing the costs of unused stock in ATMs.The implemented algorithm made use of JENES which is an open source Java library for genetic algorithms. The standard deviation, average and the p-value were used as evaluation functions for the various combinations, and the results were tabulated for each for the study. Cisternas-Vera et al. [2016] looked at the impact of mobile banking on reshaping the bank branch network for an American (US)-based bank. Taking the geo-coded transaction data, a dynamic structural model was developed and this represented the consumers’ preferences for online and physical channels. The model was used to optimize the branch network in terms of capacity, amenities, location, and number of branches. The data comprised a sample for more than 500 000 accounts with more than 1.7 billion transactions across all channels from June 2007 to June 2015. A combination of discrete choice dynamic programming and linear programming was used to develop the model. The authors in the above study solved the problem of optimization using various techniques. Bilir and Doseyen [2018] focused on the optimization of ATM and branch cash operations using an integrated cash requirement forecasting and cash optimization model. The aim of the study was to optimize the cash supply chain for ATMs and branches for a mid-sized bank in Turkey. In optimizing the cash supply chain, the main objective of the model was to minimize idle cash levels without decreasing customer satisfaction. The optimization phase of the study formulated an optimum transfer schedule for branches and ATMs by making use of an integer formulation. The aim of the optimization model was to minimize total cost, which had two parts, transfer costs and the penalty fee for maintaining idle cash. Broda et al. [2014] presented an approach for optimizing fill-in strategies for ATMs for a bank in Serbia. The Credit Agricole Bank had an aim of minimizing costs which were listed as; cash freezing costs, transportation costs and insurance costs while ensuring customer service was not impacted. The data used were daily transactions for a total of 40 ATMs in the region of Novi Sad. The approach used by the authors was to firstly model daily cash demand by using a combination of time series analysis, linear regression and the compound poisson process. Lastly, the authors then optimized the ATM filling strategies by using a simple algorithm that performs an exhaustive search for a feasible solution space. Ekinci et al. [2014] attempted to optimize the ATM cash replenishment strategies by utilizing group- demand forecasts. The data for a total of 152 ATMs in Istanbul was used for a period of 2 years and 6 months. Based on the longitude and latitude of each ATM, these were grouped via an equation which generated a total of 27 groups. Cash withdrawals were averaged daily for each group and forecasting 11 techniques were then applied for each group. An optimal time between two replenishments was found by using an objective function with constraints. Nigmatuli and Chaganva [2022] created a cash management system for ATMs and information payment terminals. The data was sourced from Kaggle which is open source and it represented cash withdrawals for a period of 6 years. The study was conducted on a total of 3 ATMs in India, Bangalore. A combination of forecasting models were used to sufficiently obtain a good level of correct forecasts. The triple exponential model, singular spectrum analysis model, and a recurrent neural network with Long Short-Term Memory were used for forecasts. A 2 week forecast horizon was used for each ATM and the discrete optimal control was used for optimization. Similarly, Lazaro et al. [2017] attempted to improve cash logistics for bank branches for their case study but they used a similar approach that was utilized by Nigmatuli and Chaganva [2022]. Lazaro et al. [2017] attempted to optimize cash logistics by firstly forecasting transactions by using machine learning and applying a robust optimization model. The data was sourced from a national bank for a period of 4 years. The cash demands that were generated were then fed onto a defined integer programming problem for optimization. Dilijonas et al. studied the problems of management and optimization of ATM network systems. A flexible artificial neural network approach was used for cash demand forecasting. Each machine was assigned a separate 3 layer feed-forward neural network and these were trained based on the Levenberg- Marquardt optimization method. A combination of multi-agent technologies and ANNs were used for optimization. The authors found that a combination of the technologies gave an advantage for managing cash optimization dynamically in complex systems. The model resulted in 20-30% average reduction in cash. Serengil and Ozpinar [2019] much like Nigmatuli and Chaganva [2022] and Lazaro et al. [2017] utilized cash flow prediction with a combination of an optimization technique. The data was sourced from a Turkish bank for a total of 6500 ATMs. The data span a period of years starting from 2013. An artificial neural network was used to forecast the next day’s demands with a total of 15 features. The cash demand forecast from the neural network model were then consumed to compute the optimal replenishment strategy using a cost function that used linear optimization. Thanh et al. [2023] viewed the ATM cash replenishment problem as a multi-objective logistics optimization problem. The authors extended the Vehicle Routing Problem to formulate a Multi-objective Optimization Problem (MOP). The authors stated the problem as a mathematical problem which became a complete directed graph with vertices. Constraints were then applied to the model and these needed to be satisfied in order to come up with an optimal load strategy. The OR-Tools package was used for optimization by initializing variables based on the constraint of the MOP. The testing and evaluation of the model was done by using data from a bank in Hanoi, Vietnam. Van Anholt and Vis [2010] proposed a novel dynamic approach that factors in both ATM cash demand forecasting and ATM replenishment in an integrated way. The authors proposed decomposition on the forecasting component of the integrative approach. On this case decomposition was a changed linear regression technique, based on the demonstrated results, the decomposition outperformed the artificial neural network approach. Ghodrati et al. [2013] studied the ATM cash management problem by utilizing a genetic algorithm. They sourced their data from an Iranian bank for a period of 2 years for 164 ATMs. The ATMs were clustered into 3 groups of high, medium and low. A genetic simulation optimization algorithm was implemented as a solution. Chotayakul et al. [2013] formulated the cash inventory problem for ATMs as a Mixed Integer Pro- gram (MIP). They developed it into the path formulation in deriving the near-optimal solution of the problem. The ATM network that was studied came from a bank based in Thailand. Bati and Gozupek [2017] devised an integer linear program (ILP) that was able to jointly optimize routing and cash man- 12 agement for a new generation of ATMs. Their aim was to minimize costs like idle cash costs and costs associated with logistics. A polynomial-time heuristic algorithm was implemented by using CPLEX. Kurdel and Sebestyenova [2013] viewed the routing optimization problem as a dynamic vehicle routing problem (VRP) which is similar to what was explored by Thanh et al. [2023]. Kurdel and Sebestyenova [2013] solved the formulated problem by using a parallel genetic algorithm. Kiyaei and Kiaee [2021] studied ATM cash replenishment planning by making use of a deep Q-network for opti- mization. This model used a combination of deep learning and Q-learning. In their study, Kiyaei and Kiaee [2021] came up with two components for solving the problem. The first component learns the dynamics of the cash demand while the second one uses Q-learning to learn the action-value function. Mubiru and Nalubowa Ssempijja [2024] attempted to model and optimize cash load strategies for ATMs under stochastic demand. A total of ten ATMs were used for the study for commercial hubs in the city of Kampala. Each ATM’s data was based on hourly transactions for a period of 4 months. The authors were able to develop an optimization model which used the Markov decision approach for cash loading strategies. Li et al. [2009] tackled a unique problem of selecting the best site for a new ATM by using the Particle Swarm Optimization algorithm. Their study focused on a bank based in China, Tianjin, and they surveyed a total of 50 ATMs. In their preliminary results, Li et al. [2009] demonstrated that they can address the site selection problem while at the same time being able to forecast transactions for the same site. Table 2.2: A table comparing different optimization techniques that were used by other authors. Author Data Models [Armenise et al. 2012] Transaction data Genetic Algorithm [Cisternas-Vera et al. 2016] Transaction data Linear and dynamic programming [Bilir and Doseyen 2018] Transaction data Integrated model [Broda et al. 2014] Transaction data Exhaustive search [Ekinci et al. 2014] Transaction data Linear programming model [Nigmatuli and Chaganva 2022] Transaction data Optimal control [Lazaro et al. 2017] Transaction data Integer programming [Dilijonas et al.] Transaction data Multi-agent technologies [Serengil and Ozpinar 2019] Transaction data Cost function and linear optimization [Thanh et al. 2023] Transaction data Pareto-based multiobjective learning algorithms [Van Anholt and Vis 2010] Transaction data Decomposition model [Ghodrati et al. 2013] Transaction data Genetic Algorithm [Chotayakul et al. 2013] Transaction data Mixed Integer Program (MIP) [Bati and Gozupek 2017] Transaction data Heuristic CPLEX [Kurdel and Sebestyenova 2013] Transaction data Parallel Genetic Algorithm [Bati and Gozupek 2017] Transaction data Q-learning [Mubiru and Nalubowa Ssempijja 2024] Transaction data Markov decision approach [Li et al. 2009] Transaction data Particle Swarm Optimization 13 2.5 Related Work 2.5.1 Mapping Influence in Complex Systems Understanding influence relationships in probabilistic graphical models is essential for analyzing the behavior of interconnected systems. Influence modeling within Bayesian networks has been applied across domains to capture dependencies and quantify structural changes. Research in this area pri- marily employs structure learning techniques, including score-based and constraint-based methods, to recover underlying influence patterns from data. These approaches have been used in diverse applica- tions, such as modeling stochastic processes, evaluating resilience in complex networks, and optimizing decision-making frameworks. By leveraging probabilistic reasoning, influence modeling provides valu- able insights into how system dynamics evolve under various conditions, including node removal and disruptions. Influence diagrams and belief networks are defined as directed graphical models for representing models of probabilistic reasoning and decision making [Shachter 2007]. Important relationships can be captured easily with the graphical representation of influence diagrams and belief networks. Influence diagrams are more suited to capture the conceptual relationship between inference and decision making [Biedermann and Taroni 2023]. Influence networks defined as Bayesian networks whose probabilities are approximated by expect provided influence constants [Zaidi et al. 2010]. Influence relations in graphical models refers to the dependencies or causal connections between variables represented within a graph [Koller and Friedman 2009]. Ajoodha [2018] attempted to recover the influence relations by using a complete algorithm for a collection of partially observable multi-dimensional stochastic processes. Score-based structure learn- ing was used to identify the optimal structure by applying a scoring metric to evaluate structural config- urations in relation to the data. Ajoodha [2018] expands on score-based structure learning concepts for tracking influence among stochastic processes. Firstly, Ajoodha [2018] decomposed the distributions represented by stochastic processes into a series of temporal models. Then, an assembly and a scoring function was defined to assess the quality of potential influence networks. This then became a combinatorial optimization problem, which required the navigation of the search space to find the optimal influence network, the best fitting structure for the training data was then returned. Ajoodha and Rosman [2020] attempted to study the influence between stochastic processes that are represented by hidden Markov Models (HMMs). A structure learning approach was also used in this study and the candidate influence structures were dynamic influence networks (DINs) which model the relationships between the stochastic processes. In a different study, Ajoodha and Rosman [2021] used constraint-based structure learning to recover the direct influence between hierarchical dynamic Bayesian networks. This study still used structure learning but with focus being on hierarchical dynamic bayesian networks. The learning tasks were compared by using KL-divergence. In another study, Ajoodha and Rosman [2017] still made use of score-based structure learning in an attempt to recover the influence structure but in this case naı̈ve Bayes models were used instead of hidden Markov Models or hierarchical dynamic Bayesian networks as stated before. A total of four parameter or structure learning tasks were explored to recover the ground truth distribution. Mohammadi and Wit [2014] concentrate on Bayesian structure learning in Gaussian graphical model, addressing both decomposable and non-decomposable cases. They introduced a new Bayesian framework for determining Gaussian graphical models and developed a birth-death Markov chain Monte Carlo (BDMCMC) algorithm to carry out both structure learning and parameter learning. To evaluate the performance of the BDMCMC algorithm, the KL-divergence was used on the precision matrix. 14 Chen et al. [2020] introduced a Bayesian network-based approach to model Nitrogen utilization efficiency of dairy cows. Historically, previous authors utilized statistical models to predict manure N excretion. Chen et al. [2020] used a dataset from the digestibility studies from the AGri-Food and Biosciences Institute in Nothern Ireland. The Growth-Search as a constraint-based structure learning approach was used to recover the influence between the variables. The Bayesian network model was able to capture the relationships between the variables and it also established a causal influence model. In another study, Zheng et al. [2016] used Bayesian networks to model methane emissions from milking dairy cows. Previously, different regression based models were applied to examine the findings which is similar to how Chen et al. [2020] extended from statistical models to Bayesian networks. The dataset was also collected from the Agri-Food and Biosciences Institute in Nothern Ireland. The evaluation took place on dataset with 934 milking dairy cows. Structure learning was done by using Hill-Climbing structure learning and it utilized a score-based approach. Matejová et al. [2023] used influence modeling to study the factors associated with atherosclero- sis. Cardiovascular diseases have many factors that are associated with atherosclerosis [Matejová et al. 2023]. A combination of statistical methods and different machine learning techniques were utilized for the study. The Bayesian networks in this instance were used to study the relationships between the different factors. Tang et al. [2020] evaluated the resilience of urban transportation systems by utilizing a Bayesian network model. The study focused on investigating the resilience of road transportation for 4 cities in China. These cities were Beijing, Tianjin, Shanghai and Chongqing for a period starting in 1998 to 2017. A hierarchical Bayesian network model was proposed to evaluate the resilience of these systems. Backward inference, forward inference and sensitivity analysis were used to test the proposed model and the results demonstrated reasonable values. Hosseini et al. [2016] assessed system resilience by also utilizing Bayesian networks similar to the study done by Tang et al. [2020]. Hosseini et al. [2016] focused on a case study of a sulfuric acid manufacturing plant that had 2 suppliers of sulfur. The developed Bayesian network model was able to capture various disruptions ranging from power outages, technical errors to hurricanes. Hosseini et al. [2016] validated their model by making use of sensitivity analysis which is the same approach that was used by Tang et al. [2020] for evaluating their model. Special types of reasoning were performed and these were belief propagation for inference purposes. Jaradat et al. [2020] modeled the inter-dependencies between critical features by making use of a Bayesian network. Their case study was focused on an inland waterway port and surrounding supply chain network. A Bayesian network captured the interdependency between the factors. The results were probed by making of use of belief propagation and sensitivity analysis. Hossain et al. [2019] evaluated another study for an interdependent electrical infrastructure system. This study also used a Bayesian framework to assess system resilience like in Jaradat et al. [2020]. The framework addressed a range of possible disruptions and proposed solutions to mitigate their consequences. A case study of the Washington, DC electrical infrastructure system was used. To evaluate resilience, different measures were used and these were sensitivity analysis, forward propagation, information theory and backward propagation. Liu et al. [2020] presented a robust dynamic Bayesian network method for supply chain disrup- tion risk estimation. The authors proposed a new and general nonconvex programming model, and this was compared with the classic DBN approach on a case study. Kammouh et al. [2020] evaluated the resilience of engineering systems by utilizing a probabilistic framework. The authors developed 2 ap- proaches that could handle static and dynamic systems. A Bayesian Network (BN) was used for static engineering systems and Dynamic Bayesian Networks (DBN) were used for dynamic systems. The pa- per made use of 2 illustrative systems to apply the methodology on, one being evaluating resilience for a transportation system and the other being for Brazil. 15 Table 2.3: A table comparing different influence modeling techniques that were used by other authors. Author Models for mapping influence [Ajoodha 2018] Score-based structure learning [Ajoodha and Rosman 2020] Hidden Markov Models [Ajoodha and Rosman 2021] Constraint-based structure learning (hierarchical dynamic Bayesian networks) [Ajoodha and Rosman 2017] Score-based structure learning [Mohammadi and Wit 2014] Birth-death Markov chain Monte Carlo (BDMCMC) [Chen et al. 2020] Growth-Search [Zheng et al. 2016] Bayesian modeling [Matejová et al. 2023] Bayesian modeling [Tang et al. 2020] Bayesian modeling [Hosseini et al. 2016] Bayesian modeling [Jaradat et al. 2020] Bayesian modeling [Hossain et al. 2019] Bayesian modeling [Liu et al. 2020] Dynamic Bayesian network [Kammouh et al. 2020] Bayesian network and dynamic Bayesian network 2.5.2 Conclusion The related work explored in this chapter spans multiple domains, including forecasting ATM transac- tion volumes, optimizing cash flow and resource allocation, and mapping influence in complex systems. Each of these areas provides crucial insights that contribute to understanding ATM networks as prob- abilistic graphical models. Forecasting techniques have demonstrated various approaches to predicting cash demand and transaction volumes, offering methods that aid in resource planning and operational efficiency. Time-series models, machine learning algorithms, and hybrid forecasting techniques have been extensively studied to improve accuracy and adaptability under varying market conditions. These methodologies set a foundation for understanding transaction patterns, which is essential when analyz- ing network resilience following disruptions. Optimization studies have focused on the logistical and financial challenges of ATM networks, emphasizing how resource allocation strategies can minimize op- erational costs while maintaining service availability. Various optimization techniques, including mixed- integer programming, heuristic algorithms, and simulation-based approaches, have been employed to enhance cash replenishment scheduling and ATM placement decisions. These strategies highlight the significance of balancing cash flow efficiency with customer demand to ensure uninterrupted service. The discussion on influence modeling extends these considerations by examining how interdepen- dencies between network nodes impact overall system behavior. Studies in this area provide a framework for understanding the propagation of disruptions across networks and how probabilistic graphical mod- els, such as Bayesian networks, can be employed to analyze these interactions. Research on network resilience, influence maximization, and cascading failures in complex systems further contextualizes the importance of modeling ATM networks as interconnected structures subject to perturbations. By integrating these themes, this chapter establishes a comprehensive foundation for the research problem addressed in this dissertation. The intersection of forecasting, optimization, and influence modeling pro- vides a structured approach to investigating how ATM networks respond to node removal. The insights gained from prior work inform the methodology adopted in subsequent chapters, where the impact of node removal on Bayesian networks is analyzed. This study contributes to the broader research land- scape by applying Bayesian learning techniques to quantify structural and probabilistic changes in ATM networks, bridging the gap between theoretical modeling and real-world financial systems. 16 Chapter 3 Methodology This chapter outlines the methodological approach for analyzing the impact of node removal on Bayesian networks representing ATM networks. Our study aims to develop a comprehensive framework for un- derstanding how the removal of nodes which represent ATMs affects the overall network structure and its probabilistic dependencies. We begin by formulating the problem mathematically, modeling the ATM network as a directed acyclic graph (DAG) with conditional probability distributions (CPDs) assigned to each node. This formulation enables us to capture the complex interactions and dependencies between transaction volumes at different ATMs. We then present the research objectives, accompanied by a detailed experimental design to in- vestigate the effects of node removal on the joint probability distribution and the recovered network structure. Our methodology incorporates structure learning and parameter estimation techniques to re- cover the network post-removal, and Kullback-Leibler (KL) divergence to quantify deviations from the original distribution. By modeling these dynamics, we aim to provide insights into the resilience and stability of ATM networks under varying disruption scenarios. This approach contributes to the broader understanding of influence relationships in probabilistic graphical models, particularly in applications requiring robust and adaptive network architectures. 3.1 Problem Formulation Many problems in the field of science can be modeled as stochastic processes. Stochastic processes are used to describe the evolution of random variables over time or space. Applications of stochastic processes can be used for modeling and analyzing random phenomena, giving a clear understanding into how the complex systems interact under uncertainty. We can learn the influence between these variables if we know that they interact with each other. Suppose we have a set of variables X = {X1, ....,Xn} and we have a joint distribution PG(X ∣ Y ) that represents the relationship between them. A Bayesian network that represents our problem as a tuple B = (G, Θ) can be generated where G is a directed acyclic graph (DAG) whose nodes represent the set of variables from above X = {X1, ....,Xn} and Θ represents the parameters of the probability distributions associated with each node in the graph. These parameters define how likely certain values are for a given node, conditioned on its parents in the graph. Suppose that a subset of the variables in this set X = {X1, ....,Xn} had to be mutilated, how would this affect the joint probability distribution PG(X ∣ Y ) that represented the relationship between these variables. Firstly, we will attempt to learn the parameters of a joint probability distribution which describes the relationship between the variables X = {X1, ....,Xn}. We will then attempt to learn parameters of a probability distribution which describes the relationship between the variables X = 17 {X1, ....,Xn} when a subset of these variables is removed. A probabilistic graphical model can be used to represent the dependencies between different variables, with this graphical model we can then perform inference and learning for knowledge discovery. 3.2 Research Questions and Experimental Setup Based on our problem formulation, we looked at three research questions: Research Question 1: What is the impact on the KL divergence when nodes are removed from a recovered Bayesian network? To address this question, we systematically mutilated Bayesian networks by removing nodes and the edges connected to them. For each modified network, we performed structure learning using a score- based approach, estimated the CPDs with Bayesian estimation, and compared the recovered network to the original ground truth distribution. The KL divergence served as the primary metric for quantifying the deviation between the original and recovered distributions. By experimenting with varying percent- ages of node removal and analyzing the results, we identified patterns and thresholds where network recovery begins to degrade significantly. Research Question 2: What are the implications of varying structural complexity on the re- covery of Bayesian networks and their KL divergence? We explored this question by generating synthetic Bayesian networks with distinct structural con- figurations, including tree-like, sparse, and dense structures. For each structure type, we evaluated the performance of structure learning algorithms under different sample sizes and levels of node removal. By comparing the KL divergence across these configurations, we assessed how structural complexity influences the network’s recoverability. This analysis highlighted the trade-offs between complexity and robustness, providing insights into the resilience of different network designs under disruptions. Research Question 3: How can insights from Bayesian networks improve the operational efficiency and planning of ATM networks? To tackle this question, we modeled ATM networks as Bayesian networks, where nodes represented individual ATMs, and edges captured dependencies between their transaction volumes. Using histori- cal transaction data, we trained and validated these models, then we simulated disruptions (e.g., ATM outages) by removing nodes. By analyzing the resultant changes in network influence and identify- ing critical nodes or clusters, we derived actionable insights for improving operational efficiency. This includes optimizing cash replenishment schedules, predicting the impact of outages, and planning the deployment of new ATMs to enhance network resilience and customer satisfaction. 3.2.1 Experimental Setup In this study, we aimed to investigate the effect of removing a subset of variables from this set X = {X1, ....,Xn} on the joint probability distribution of the graph which can be represented as PG(X ∣ Y ). We modeled the relationship between the variables using a Bayesian network as a tuple B = (G, Θ) where G is a directed acyclic graph (DAG) whose nodes represent the set of variables from above X = {X1, ....,Xn}. We generated a random ground truth distribution for this Bayesian network by creating a Bayesian structure and assigning conditional probability distributions (CPDs) to each node in the network. The structure of the Bayesian network was generated by randomly specifying the number of nodes and randomly connecting them with edges to form a directed acyclic graph (DAG). Each node in the 18 network had a CPD assigned based on its parents in the graph. We generated the CPDs randomly by using a distribution that best described the nature of our variables and met the assumptions of our model. We chose from a variety of distributions, this being from a Bernoulli distribution, Gaussian distribution, Dirichlet distribution and we ended up using the Bernoulli distribution which was a specific distribution which best met our assumptions of our model and described the nature of our chosen variables. When the structure of the Bayesian network and the CPDs were defined, we generated random samples from the Bayesian network to create a dataset. Suppose that the learned probability distribution from structure learning of our Bayesian network that we generated was represented by PG(X ∣ Y ), and suppose that we remove a node or a subset of nodes from the graph and we learn the new probability distribution as P Ĝ (X ∣ Y ). We can make use of our generated dataset from the random samples to perform probabilistic queries that will help us gauge how removing a node or a subset of nodes influences other nodes in the network. We choose Bernoulli conditional probability distributions (CPDs) for each node because, after quan- tizing ATM transaction volumes into binary states (e.g., “high-volume” vs. “low-volume” activity), the presence or absence of significant transaction patterns is most naturally modeled as a two-state random variable. Bernoulli CPDs thereby allow us to capture essential fluctuations in ATM usage while main- taining tractable inference: exact likelihood calculations and structure-learning algorithms (hill-climb, tree search) remain computationally efficient even for networks of hundreds of nodes. Furthermore, fo- cusing on binary CPDs simplifies the interpretation of KL-divergence results—any change in node be- havior manifests directly as shifts in binary probabilities—without confounding effects from multi-state parameterizations. We acknowledge that real ATM transaction data often exhibits richer, continuous or count-based dynamics; in future extensions, Gaussian or Poisson CPDs could be incorporated once sufficient real-world data is available, but the Bernoulli framework provides a robust proof-of-concept for analyzing network resilience under node removal. 3.3 Application Domain Each variable can be represented as a node in a graphical model, with associated variables being the parameters of each node. We have now represented the problem as a probabilistic graphical modeling problem, firstly we will represent the problem statement as a graphical model, generate random Bayesian networks for the true distribution, and show how inference can be done on the model for knowledge discovery. 3.3.1 Modeling Influence Relationships Suppose we have this set of variables X = {X1, ....,Xn} and we have a joint distribution that represents the relationship between them. We can formally define a Bayesian network that represents our problem as a tuple B = (G, Θ) where: • G is a directed acyclic graph (DAG) whose nodes represent the set of variables from above {X1, ....,Xn}. • Θ is a set of parameters that specify the conditional probabilities ΘXi ∣ Pa(Xi) for each node Xi given its parent Pa(Xi) in G. 19 Figure 3.1: An influence structure between stochastic processes. The red arrows indicate the influence between these processes while the solid arrows indicate the parameters of each node. Let P1 in Figure 3.1 represent a node in the network which is made up of n different nodes. If we assume that the future is conditionally independent of the past given the present, this system would maintain the Markov property allowing us to make the Markov assumption. If the structure or the parameters of the system do not change under any transformations, this would maintain the invariance assumption of the system allowing us to make the stationary assumption. Our ground truth distribution was generated by following the process that we touched on in the experimental set up section. We firstly defined the graph structures by choosing to use between 50-100 variables that were on the nodes in our graph structure, which was a directed acyclic graph (DAG) that is represented by Figure 3.1. We explored tree-like, sparse and dense structures for our ground truth distributions. We generated random Bernoulli distributions that best described our graph in Figure 3.1. These distributions specified the conditional probability distributions (CPD) of each node in the graphi- cal structure. These then became our random structures with random parameters that best described our ground truth distribution. Once we had the structure of the Bayesian network and the CPDs defined, we used forward sampling to generate random samples from the Bayesian network and created a dataset that we used for inference and learning. For this study, we explored the below. • We explored the following structures; tree-like, dense and sparse structures. • We explored the Bernoulli distribution for our Conditional Probability Distributions. 3.4 Mutilating the Network In the experimental section, we mentioned that we have a set of variables X = {X1, ....,Xn} and a joint distribution PG(X ∣ Y ) that represents the relationship between them and, if we remove a variable or a subset of variables and we learn a new joint probability distribution that can be represented as P Ĝ (X ∣ Y ). We want to gauge how removing a variable or a subset of variables influences other nodes in the network. We will remove a random subset of the variables by starting off at 4 variables and incrementing this number randomly as we try to evaluate how much does the distribution deviate from the original distribution. When we remove variables, we will also be removing the edges that lead up to those variables that are represented as nodes in the graphical structure. We will measure this deviation by 20 making use of the Kullback-Leibler (KL) divergence which measures how one probability distribution diverges from a second, expected probability distribution. 3.4.1 Learning Now that we have mutilated our graphical model, we will need to learn the new structure and param- eters from the generated data. There are two main types of learning in Probabilistic Graphical Models (PGMs), namely parameter estimation and structure learning. We will first learn the graphical struc- ture of our data from the mutilated network then proceed with estimating its parameters for our ground truth distribution. Secondly, we will learn the optimal network structure that best represents the depen- dencies between our variables which in this case will be ATMs. Learning in PGMs is the cornerstone for constructing models that accurately capture the underlying probabilistic relationships in the data. This allows for the creation of models that can make accurate predictions and inferences based on the available data. 3.4.1.1 Parameter Estimation Parameter estimation refers to the process of determining the parameters of the conditional probability distributions (CPDs) associated with each node in a graph [Bishop 2006]. The probabilistic relationships between the variables are specified by these parameters and they are crucial for making predictions and performing inference. If the structure of the PGM is known, in which we do have in this case, we can use the Maximum Likelihood Estimation (MLE) and Bayesian estimation for parameter estimation. Two assumptions need to be met before we can perform parameter estimation and these are: • That D = ξ[1], ..., ξ[M] is sampled from P ∗ with P ∗ being the true or target probability distribu- tion of variables in the model. • Instances are independent and identically distributed (IID). Maximum Likelihood Estimation (MLE) as stated, is one of the methods that can be used to estimate the parameters of a model based on observed data. The MLE estimate for the parameters is obtained by maximizing the likelihood function, which is the probability of observing the given data under the model. For the dataset D = x1, x2, ..., xN where each x(i) is a set of observed values for the variables in the model, the likelihood function is given by: L(Θ ∣ D =∏N i=1 P (x (i)∣ Θ) Where Θ represents the parameters of the model. The MLE estimate for Θ can be obtained by maximizing this likelihood function. Θ̂MLE = argmaxθ L(Θ ∣ D) Often, the log-likelihood function is more convenient to maximize and it is expressed as follows: Θ̂MLE= argmaxθ logL(Θ ∣ D) Bayesian estimation is an alternative to MLE, it estimates the parameters by incorporating prior beliefs about the parameters. Bayesian estimation combines the likelihood of the data with a prior distribution over the parameters to compute a posterior distribution. We can encode our knowledge as a prior knowledge about θ using a probability distribution. Here we assume that the outcome is conditionally independent given the parameter θ. P (x[1], ..., x[M,θ]) = P (x[1], ..., x[M]∣θ)P (θ) = P (θ)∏M m=1 P (x[M]∣θ) P (θ∣x[1], ..., x[M]) = P (x[1],...,x[M]∣θ P (θ)P (x[1], ..., x[M]) 21 Figure 3.2: A visual representation of a dataset with θ representing the parameters. If the prior is a Beta distribution, then the posterior is a Beta distribution. As we obtain more data, the effect of the prior diminishes. Thus the Bayesian framework allows us to trade-off a diminishing prior as more data becomes available. One common choice for a prior distribution is the Dirichlet prior. The Dirichlet distribution is a generalization of the Beta distribution. If the prior, P (θ), is Dirichlet then the posterior, P (θ ∣ D), is Dirichlet. If P (θ) is Dir(α1, ..., αk), then P (θ ∣ D) is Dir(α1 +M[1], ..., αk + M[k]) where M[k] is the number of occurrences of xk 3.4.1.2 Structure Learning In Bayesian networks, structure learning involves determining the optimal network structure that best represents the conditional dependencies between variables [Koller and Friedman 2009]. There are two main approaches for structure learning in Bayesian networks, namely Constraint-based methods and Score-based methods. The problem statement for structure learning can be formulated as below. • We assume that the dataset D = {ξ[1], ..., ξ[M]} consists of independent and identically dis- tributed samples drawn from the true distribution P ∗(X ) • P ∗(X ) is induced by a Bayesian network G∗ over X • To what extent do independencies manifest in G∗ that manifest in D We want to find the Bayesian network structure, G, that best represents the dependencies that man- ifest in D. Constraint-based approaches are methods that rely on statistical tests to infer conditional indepen- dence relationships between variables. These methods build an independence map (I-map) which rep- resents the conditional independence properties of the distribution. We aim to capture the network structure that accurately represents the independencies in the domain and the basic idea is to build the best minimal I-map. Score-based approaches are methods that assign a score to each possible network structure based on how well it fits the data. Common scoring functions include the Bayesian Information Criterion (BIC), Akaike Information Criterion (AIC), and the Bayesian Dirichlet Equivalent (BDe) score. These meth- ods search the space of possible structures to find the one that maximizes the chosen scoring function. Examples of search algorithms include greedy search, hill-climbing, and simulated annealing. In the case of our modelling problem for ATMs, by determining the optimal structure of a probabilis- tic graphical model that best fits the data, one can gain insights into the relationships and dependencies between these variables (ATM transaction volumes), which can be crucial for both understanding the underlying processes and making accurate predictions. 22 3.5 Evaluation In probabilistic graphical models, inference is the process of answering probabilistic queries about the variables in a model based on observed evidence [Heckerman 1995]. Inference in PGMs can be com- putationally intensive, which makes it an NP-hard problem. However, these are results for worst case scenarios and inference with graphical models is mostly manageable for practical applications. We want to be able to compare our ground truth distribution which we generated by following the method that we discussed in the experimental setup section with the new distributions of the mutilated Bayesian networks. We will compare these distributions at the level of probabilistic queries for evaluation. Suppose that the probability PB(x) = α is the probability of x in the ground truth distribution and the probability PB†(x) = β is the probability of x in the new joint probability distribution of the mutilated network. The probability of x in this case would have changed and we can calculate the Mean Squared Error (MSE) between these two parameters by using the formula below: MSE = 1 n∏ N i=1(α − β) 2 To compare the learned distribution from the ground truth distribution we will be using a well known formalism for doing this which is the KL divergence. The Kullback-Leibler (KL) divergence, is a measure used in probability theory and information theory to quantify how one probability distribution differs from a second, reference probability distribution. We will be measuring how many bits have changed from the learned distribution, and this will be done on the test data or the validation data. Given two probability distributions P and Q defined over the same probability space, the KL diver- gence from Q (the approximate distribution) to P (the true distribution) is defined as: DKL(P ∣∣Q) = ∑x P (x) log ( P (x) Q(x)) For continuous distributions, the sum is replaced by an integral: DKL(P ∣∣Q) = ∫ ∞ −∞ p(x) log ( p(x) q(x))dx Since we are generating the ground truth distribution on our own, we are able to verify the learned model to the actual ground truth model. Unfortunately, we will not be able to do the same with the application data which is the ATM data because for that case we do not have access to the ground truth model. However, we can evaluate the model for the ATM data by using accuracy, recall and precision. There are several types of inference tasks in PGMs, we will look at variable elimination and approximate inference for performing probabilistic queries on our learned graphical model. Figure 3.1 represents an influence structure between stochastic processes, if we reduce the graph structure by removing a node and the edges leading to P2 we get the below influence structure. 23 Figure 3.3: An influence structure between stochastic processes with a reduced graph structure. If one of the parameters of a node represents a stochastic process and we remove a node as in Figure 3.3 from the network after we have learned a probabilistic graphical model, how would this influence the other nodes in the network of the learned structure and the probability density. We can also test for probabilistic inference by having an algorithm that removes nodes from the graphical model. If we let PG(X ∣ Y ) represent the graphical model in Figure 3.1 and let P Ĝ (X ∣ Y ) represent the graphical model in Figure 3.3 which is the reduced graph, we know that PG(X ∣ Y ) ≠ P Ĝ (X ∣ Y ). Given the reduced graph Ĝ, we want to test if we can still do the same inference with Y as we did with the original graph G. In order to test for this we will utilize the 2 inference approaches that are explained below. 3.5.0.1 Variable Elimination Exact inference refers to the computation of exact probabilities or marginal distributions of variables in the model given some evidence [Pearl 1988]. Variable elimination is an exact inference technique for eliminating variables from a joint distribution by iteratively summing out variables based on the factorization of the joint distribution. The main idea of variable elimination is to exploit the structure of the Bayesian network to simplify the computation of marginal probabilities. Below are some of the properties of the structure of a Bayesian network: • Some sub-expressions of the joint distribution only depend on a small number of variables. • By caching results we do not have to recompute them. • Some important properties are: 1. P (A) = ∑B P (A,B). 2. ∑C P (A)P (B,C) = P (A)∑C P (B,C) Below is an overview of the variable elimination algorithm: 1. Initialization - start with the factors corresponding to the conditional probability tables (CPTs) of the variables in the network. 2. Elimination - for each variable Xi in the order specified do the following: 24 • Multiply all factors containing Xi to form a new factor Fi. • Sum out Xi from Fi to get new factor F̂i. • Replace the factors containing Xi with F̂i in the factor list. 3. Termination - after eliminating all variables, the remaining factors contain the joint probability distribution of the remaining variables. 4. Normalization - to obtain the marginal probabilities, normalize the factors by dividing by the sum of their values. To reduce computational complexity of inference, we can make use of variable elimination as it eliminates redundant calculations. We can make use of variable elimination to make inference for our problem in Figure 3.3. 3.5.0.2 (Alternate) Approximate Inference In Probabilistic Graphical Models, approximate inference refers to a collection of techniques used to approximate the posterior distribution over variables of interest when exact inference is computationally infeasible [Wainwright and Jordan 2008]. These techniques attempt to provide a close approximation to the true posterior distribution while reducing computational complexity. We will utilize two approximate inference techniques which are namely, Gibbs sampling and Markov Chain Monte Carlo (MCMC). The Gibbs sampling method attempts to “fix” the sample, by re-sampling some of the variables. It generates a sample with arbitrary values (except for the evidence), and then re-samples all variables while updating the evidence in an arbitrary order. Unlike forward sampling, the Gibbs sampling technique takes into account the downstream evidence. As we repeat the sampling process, we get closer and closer to the desired posterior: Pϕ(X) = P (X ∣ e) The Gibbs sampling technique is an example of a class of methods called Markov Chain Monte Carlo (MCMC). MCMC is an iterative process where we move from the distribution getting closer to the desired posterior distribution. A Markov chain is defined as a graph of states over which the sampling algorithm takes a random walk. The below applies to Markov chains: • The nodes of this graph are possible assignments of variable x ∈ V al(X) • The edges are a transition model, T , which specifies for each pair of states x and x̂ the probability T (x→ x̂) 25 3.6 Modeling the Influence of Transaction Volumes between ATMs It turns out that the modeling problem that is formulated in section 3.1 is an important problem in real world scenarios and one of the scenarios is modeling the relationships between ATMs. Automated Teller Machines (ATMs) are ubiquitous in our society and are widely used for financial transactions. These devices play a crucial role in the financial system as they give customers the ability to withdraw money, deposit money, customer convenience and a variety of other benefits to banks and customers. These machines are often deployed at a large scale in order to serve a client base in spread out geographical locations. The size and the spread of the ATMs introduces a logistical and financial optimization problem for banks. Understanding the relationships between ATMs can be a applied to various problems facing financial institutions. Analyzing the relationship between ATMs can assist with detecting patterns of fraudulent behavior, such as coordinated attacks on multiple ATMs. Data driven decision making can enable banks to make informed decisions about whether to install a new ATM or remove an ATM at a given location based on the relationship between all the ATMs, forecast transactions and optimize the use of resources. If an ATM or a subset of ATMs in a network of ATMs were to go off due to ATM bombing, load shedding, planned maintenance etc. How would this affect the rest of the ATMs in the same region, would cus- tomers move to transact in the nearest ATM, what associations manifest in the data and how would we go about detecting vulnerabilities in the network. Predicting the influence of the different ATMs in the network can be achieved by aggregating the features of each ATM. We will be modeling the dependencies and relationships between different ATMs and their transaction volumes by using a probabilistic graphical model. The variables of study will be mainly the ATM transaction volumes, each node in the Bayesian network will represent a variable. The nodes are the variables and they are used to represent ATMs. In our case, each node will represent the transaction volumes of individual ATMs. The probability distribution of each node which is used to represent ATMs is going to be transaction volumes. There will be local probability distributions inside of each of these nodes that models the ATM transaction volumes. The Conditional Probability Distribution for the variable is a function of the vari- able itself and the parents. This is modeled as a local probability distribution, it can be modeled using a tree structure, tree CPD, tabular CPD and a Gaussian CPD. In our case where we are studying transaction volumes which are continuous variables, we have chosen Gaussian CPDs. The dependencies between the nodes will be represented by edges, for an example the transaction volumes at an ATM might depend on the transaction volumes at nearby ATMs, the time of day, or external factors like holidays. Historical ATM transaction data will then be used to learn the Conditional Probability Distributions and the structure of the Bayesian network. The structure of the Bayesian network will be determined by using structure learning algorithms. Structure learning is the process of determining the graph structure that best represents the dependencies among the variables in the model. Parameter learning will be used to estimate the CPDs for each node given the structure. To analyze the outcomes of the modeling process we will utilize variable elimination for exact inference. Once we have a structure that represents the dependencies between the nodes and their CPDs, we can mutilate the network by removing a node or a subset of nodes and the edges leading up to these nodes. The method of removing nodes means that we specifying the nodes that we are dropping on the graphical structure which would modify the structure of the model. Parameter estimation will then be used to learn the new parameters of the modified structure so that exact inference can be used to see how the removal of nodes affects the outcomes. Model evaluation will follow the same steps as outlined in section 3.4 and to measure the deviation between the original structure of the Bayesian network and the new structure where nodes are removed we will use Kullback-Leibler (KL) divergence which measures 26 how one probability distribution diverges from a second, expected probability distribution. The figure below, Figure 3.4 is a visual rep