The Effects of Node Removal on Bayesian Network Resilience
for ATM Network Transaction Vulnerabilities

Gcobisile Matafeni
709637

A dissertation submitted to the Faculty of Science,
University of the Witwatersrand, in fulfillment of the requirements

for the degree of Master of Science.

Supervised by Prof. Ritesh Ajoodha and Dr. Seun Olukanmi

May 30, 2025


Abstract

We investigate the evaluation of influence relationships in probabilistic graphical models, focusing on
the impact of node removal (mutilation) within Bayesian networks. The central problem addressed is
understanding how the joint probability distribution and influence structure among interconnected
variables evolve when a subset of nodes is removed, an issue relevant to various real-world systems
experiencing disruptions. We model these dynamics using Bayesian learning to provide insights into
network resilience and dependencies. To explore these effects, we generate synthetic Bayesian network
structures that are tree-like, sparse, and dense, each representing different real-world configurations
found in machine learning, sensor networks, and financial modeling. Conditional Probability
Distributions (CPDs) were assigned to nodes based on the Bernoulli distribution. The Kullback-Leibler
(KL) divergence quantified the deviations in influence structures post-removal, with evaluation of
structure recovery employing an exact inference technique. Our findings indicate that each network
type exhibits distinct responses to node removal: tree-like structures stabilize quickly with increased
data, sparse structures show higher sensitivity but recover efficiently, and dense structures offer
robustness through redundancy, though they demand larger datasets. These findings have significant
implication for optimizing complex systems, particularly those requiring resilient network
architectures. As a real-world application, we model ATM transaction networks to analyze how the
removal of ATMs (due to vandalism, load shedding, or maintenance) impacts transaction flows. Our
results show that high-traffic ATMs serve as critical nodes, significantly influencing neighboring ATMs
when removed. By applying Bayesian structure learning, we demonstrate that optimal ATM network
configurations can be identified to minimize disruption and improve financial service resilience. This
study contributes to the growing field of probabilistic graphical models by introducing a novel
approach to understanding influence dynamics in mutilated networks. It provides practical insights and
lays a foundation for further research into complex systems where node integrity and network stability
are critical for decision-making and operational efficiency.

Keywords: Bayesian network, probabilistic graphical models, mutilation, influence relationships, node
removal, ATM network resilience, Kullback-Leibler divergence, structure learning

i


Declaration

I, Gcobisile Matafeni, hereby declare the contents of this masters dissertation to be my own work. This
dissertation is submitted for the degree of Master of Science in Computer Science at the University of
the Witwatersrand. This work has not been submitted to any other university, or for any other degree.

Signature:
Gcobisile Matafeni

ii


Acknowledgements

I would like to express my sincere gratitude to my supervisor, Prof. Ritesh Ajoodha, and my co-
supervisor, Dr. Seun Olukanmi, for their unwavering support, invaluable guidance, and continuous
motivation throughout my research journey. Their expert insights and constructive feedback have been
instrumental in shaping the direction and quality of this dissertation.

I am also deeply grateful to Rudzani Mulaudzi from the Explainable AI Lab, whose assistance with
my coding challenges provided crucial breakthroughs when I felt stuck. His willingness to share his
expertise significantly enhanced my understanding and implementation of the experimental framework.

My heartfelt thanks go to Ankan, Ankur, Abinash, and Panda for their efforts in developing the
Python pgmpy framework, which served as the foundation for the experimental implementation in this
study. Their work not only facilitated my research but also enriched my practical experience with ad-
vanced probabilistic graphical models.

Finally, I wish to acknowledge my family and friends, whose continuous encouragement, under-
standing, and support have been a constant source of strength. I am especially thankful to my family, to
whom I dedicate this dissertation, for believing in my potential and for always being there for me.

iii


Contents

Abstract i

Declaration ii

Acknowledgements iii

Table of Contents iv
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1 Introduction 2

2 Background and Related Work 4
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Reasoning Patterns in Bayesian Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Forecasting Techniques for ATM Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Optimizing Cash Flow and Resource Allocation . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5.1 Mapping Influence in Complex Systems . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Methodology 17
3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Research Questions and Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Application Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3.1 Modeling Influence Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Mutilating the Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.6 Modeling the Influence of Transaction Volumes between ATMs . . . . . . . . . . . . . . 26

3.6.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6.2 Data and Ethical Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6.3 Practical Implementation and Challenges . . . . . . . . . . . . . . . . . . . . . . . 28

3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Experimental Results and Discussion 30

iv


4.1 Research Questions 1 and 2: What is the impact on the KL divergence when nodes are
removed from a recovered Bayesian network? & What are the Implications of Varying
Structural Complexity on the Recovery of Bayesian Networks and their KL Divergence? 30
4.1.1 Tree-like Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.2 Sparse Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.3 Dense Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.4 Comparative Analysis and Synthesis Across Network Models . . . . . . . . . . . 51

4.2 Research Question 3: How can Insights from Bayesian Networks Improve the Opera-
tional Efficiency and Planning of ATM Networks? . . . . . . . . . . . . . . . . . . . . . . 52
4.2.1 Comparing Original and Reduced Networks . . . . . . . . . . . . . . . . . . . . . 61

4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5 Conclusion and Future Work 64

References 70

v


List of Figures

1.1 A bombed/vandalised Automated Teller Machine (The Citizen, 2017). . . . . . . . . . . . 2

2.1 Different influence structures that depict the influence relationships of node Z with other
nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 A Bayesian network influence structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 A Bayesian network influence structure with a removed node (Z). . . . . . . . . . . . . . 5

3.1 An influence structure between stochastic processes. The red arrows indicate the in-
fluence between these processes while the solid arrows indicate the parameters of each
node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 A visual representation of a dataset with θ representing the parameters. . . . . . . . . . . 22
3.3 An influence structure between stochastic processes with a reduced graph structure. . . . 24
3.4 A distribution of ATM networks in the Johannesburg region for a South African bank.

The yellow circles with white stars in the center indicate an individual ATM, the red
crosses on these ATMs indicate a hypothetical event of removing an ATM from a network. 27

4.1 10 randomly generated tree-like DAGs with random edge orientations and random Bernoulli
CPDs assigned to each node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 A snapshot of the top 10 samples that are generated from one of the GTD DAGs. . . . . 32
4.3 A plot of the training data versus the KL divergence for GTDs with 150 nodes comparing

the Tree Search and MWST algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 A plot of the training data versus the KL divergence for GTDs with 10 nodes using

MWST for structure learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 A plot of the training data versus the KL divergence for GTDs with 10 nodes using

MWST for structure learning with mutilation. . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6 A plot of the training data versus the KL divergence for GTDs with 50 nodes using

MWST for structure learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.7 A plot of the training data versus the KL divergence for GTDs with 50 nodes using

MWST for structure learning with mutilation. . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.8 A plot of the training data versus the KL divergence for GTDs with 100 nodes using

MWST for structure learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.9 A plot of the training data versus the KL divergence for GTDs with 100 nodes using

MWST for structure learning with mutilation. . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.10 A plot of the training data versus the KL divergence for GTDs with 150 nodes using

MWST for structure learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.11 A plot of the training data versus the KL divergence for GTDs with 10 nodes using

MWST for structure learning with mutilation. . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.12 10 randomly generated sparse DAGs with random edge orientations and random Bernoulli

CPDs assigned to each node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

vi


4.13 A snapshot of the top 10 samples that are generated from one of the GTD DAGs. . . . . 38
4.14 A plot of the training data versus the KL divergence for GTDs with 10 nodes comparing

the BIC, KIC and BDScore for scoring while using the Hill Climb Search algorithm for
structure learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.15 A plot of the training data versus the KL divergence for GTDs with 10 nodes using Hill
Climb Search with BIC for scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.16 A plot of the training data versus the KL divergence for GTDs with 10 nodes using Hill
Climb Search with BIC for structure learning with mutilation. . . . . . . . . . . . . . . . 40

4.17 A plot of the training data versus the KL divergence for GTDs with 15 nodes using Hill
Climb Search with BIC for scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.18 A plot of the training data versus the KL divergence for GTDs with 15 nodes using Hill
Climb Search with BIC for structure learning with mutilation. . . . . . . . . . . . . . . . 42

4.19 A plot of the training data versus the KL divergence for GTDs with 20 nodes using Hill
Climb Search with BIC for scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.20 A plot of the training data versus the KL divergence for GTDs with 20 nodes using Hill
Climb Search with BIC for structure learning with mutilation. . . . . . . . . . . . . . . . 43

4.21 A plot of the training data versus the KL divergence for GTDs with 25 nodes using Hill
Climb Search with BIC for scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.22 A plot of the training data versus the KL divergence for GTDs with 25 nodes using Hill
Climb Search with BIC for structure learning with mutilation. . . . . . . . . . . . . . . . 44

4.23 10 randomly generated dense DAGs with random edge orientations and random Bernoulli
CPDs assigned to each node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.24 A plot of the training data versus the KL divergence for GTDs with 10 nodes comparing
the BIC, KIC and BDScore for scoring while using the Hill Climb Search algorithm for
structure learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.25 A plot of the training data versus the KL divergence for GTDs with 10 nodes using Hill
Climb Search with BIC for scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.26 A plot of the training data versus the KL divergence for GTDs with 10 nodes using Hill
Climb Search with BIC for structure learning with mutilation. . . . . . . . . . . . . . . . 47

4.27 A plot of the training data versus the KL divergence for GTDs with 15 nodes using Hill
Climb Search with BIC for scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.28 A plot of the training data versus the KL divergence for GTDs with 15 nodes using Hill
Climb Search with BIC for structure learning with mutilation. . . . . . . . . . . . . . . . 48

4.29 A plot of the training data versus the KL divergence for GTDs with 20 nodes using Hill
Climb Search with BIC for scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.30 A plot of the training data versus the KL divergence for GTDs with 20 nodes using Hill
Climb Search with BIC for structure learning with mutilation. . . . . . . . . . . . . . . . 49

4.31 A plot of the training data versus the KL divergence for GTDs with 25 nodes using Hill
Climb Search with BIC for scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.32 A plot of the training data versus the KL divergence for GTDs with 25 nodes using Hill
Climb Search with BIC for structure learning with mutilation. . . . . . . . . . . . . . . . 50

4.33 A plot of the total transaction volumes for 156 ATMs starting from January 2021 until
November 2024. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.34 A heatmap of transaction volumes for first 30 days for the first 10 ATMs. . . . . . . . . . 53
4.35 A Box Plot of transaction volumes for the first 5 ATMs. . . . . . . . . . . . . . . . . . . . 53
4.36 A plot of the distribution of transaction volumes for ATM 1. . . . . . . . . . . . . . . . . 54
4.37 A snapshot of the quantized ATM transaction data. . . . . . . . . . . . . . . . . . . . . . . 54
4.38 A recovered Bayesian Network for the first 20 ATMs. . . . . . . . . . . . . . . . . . . . . 55

vii


4.39 A recovered Bayesian Network for the first 20 ATMs overlaid on a GIS map with their
longitude and latitude. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.40 Evaluation metrics applied on 3 ATMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.41 Reduced graph structure with the removal of ATM 15 . . . . . . . . . . . . . . . . . . . . 56
4.42 Reduced graph structure with the removal of ATM 15 overlaid on a GIS map. . . . . . . 57
4.43 Evaluation metrics on two ATMs after the graph is reduced. . . . . . . . . . . . . . . . . . 57
4.44 A recovered Bayesian Network for the first 156 ATMs. . . . . . . . . . . . . . . . . . . . 58
4.45 Snapshot of a recovered Bayesian Network for 156 ATMs. . . . . . . . . . . . . . . . . . 58
4.46 Evaluation metrics applied on ATM 62, ATM 105 and ATM 3 and ATM 20. . . . . . . . 59
4.47 Reduced graph structure with the removal of ATM 15. . . . . . . . . . . . . . . . . . . . . 59
4.48 Reduced graph structure with the removal of ATM 15. . . . . . . . . . . . . . . . . . . . . 60
4.49 Evaluation metrics applied on ATM 62, ATM 105 and ATM 3 and ATM 20. . . . . . . . 60

5.1 An influence structure between stochastic processes with a reduced graph structure for
a temporal model set up. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

viii


List of Tables

2.1 A table comparing different forecasting techniques that were used by other authors. . . . 10
2.2 A table comparing different optimization techniques that were used by other authors. . . 13
2.3 A table comparing different influence modeling techniques that were used by other authors. 16

1


Chapter 1

Introduction

Numerous scientific phenomena can be conceptualized as stochastic processes, which characterize the
evolution of random variables over time or space. Employed for modeling and analysing random occur-
rences, stochastic processes offer insights into the interactions of complex systems amidst uncertainty.
By treating each process as a distinct stochastic process with variables, we can discern the relationships
and influences between them. The relationships between these variables can be expressed with a joint
probability distribution which models the influence between these variables. Given a set of these vari-
ables and a joint probability distribution that describes the influence between them, how would removing
a variable or a subset of variables affect the influence structure and the new joint probability distribution
of the remaining variables?

A grouping of Automated Teller Machines (ATMs) can be viewed through the lens of stochastic
processes. ATMs, which are prevalent in society, serve as key tools for financial transactions. They offer
customers the ability to withdraw and deposit money, providing convenience and various other benefits
to both banks and customers. Due to their widespread deployment across geographically dispersed
locations, ATMs present banks with logistical and financial optimization challenges. Analyzing the
connections between ATMs can yield insights applicable to a range of issues encountered by financial
institutions. If an ATM or a subset of ATMs in a network of ATMs were to go off due to ATM bombing,
load shedding, planned maintenance etc, how would this affect the rest of the ATMs in the same region,
would customers move to transact in the nearest ATM, what associations manifest in the data and how
would we go about detecting vulnerabilities in the network?

Figure 1.1: A bombed/vandalised Automated Teller Machine (The Citizen, 2017).

2


Historically, previous studies examining influences between ATMs have primarily focused on fore-
casting transactions and optimizing various operational aspects, such as predicting daily cash demand,
scheduling efficient cash replenishment, and strategically placing ATMs, in order to enhance overall
network efficiency and customer service. Different techniques have been explored for predicting the
daily replenishment of ATMs, namely regression approaches and Long Short-Term Memory. Simutus et
al. [2007a] tried to solve the same problem by utilizing neural networks and support vector regression
techniques which performed considerably well with forecasting cash demand. Genevois et al. [2015]
looked at the ATM location problem concurrently with cash management by utilizing a cash demand
model for every ATM in a network.

Ajoodha [2018] used dynamic Bayesian networks to model the influence between stochastic pro-
cesses by making use of score-based structure learning. Ajoodha [2018] devised an algorithm that
recovers the influence relations between a set of stochastic processes. Ajoodha and Rosman [2020]
provided a method for learning the influence between processes that are represented by hidden Markov
models (HMM) as opposed to dynamic Bayesian networks. In our study, we aim to explore how the
influence among variables evolves when a variable or subset of variables is eliminated. This approach
has not been previously explored in the existing literature that has been examined.

This study aims to explore the below questions by investigating the effect of node removal in
Bayesian networks. The key research questions guiding this investigation are as follows:

1. What is the impact on the KL divergence when nodes are removed from a recovered Bayesian
network? This question examines how the probabilistic structure and information content of a
Bayesian network change when certain nodes and their connections are removed. By quantifying
the deviation in learned distributions using KL divergence, we evaluate the network’s ability to
recover its original properties post-mutilation.

2. What are the implications of varying structural complexity on the recovery of Bayesian networks
and their KL divergence? To answer this, we analyze different types of Bayesian networks start-
ing with tree-like, sparse, and dense structures, to determine how structural differences influence
resilience to node removal. This comparison helps identify which network types are more robust
to disruptions.

3. How can insights from Bayesian networks improve the operational efficiency and planning of
ATM networks? Applying Bayesian modeling to ATM transaction networks allows us to study
how transaction flows change when ATMs go offline. By identifying critical nodes and their
influence, we provide recommendations for optimizing ATM placement and mitigating service
disruptions.

Our principal contribution is to show how the joint probability distribution and influence structure
of a Bayesian network change when a subset of nodes is removed. To accomplish this, we begin by
creating a random Bayesian network, where the nodes are connected to form a directed acyclic graph
and each node is assigned a Conditional Probability Distribution using a Bernoulli distribution. We then
generate random samples to form a dataset, learn the probability patterns, and systematically remove
nodes to re-learn the altered patterns. This process allows us to quantify how network influence and
structure are affected by node removal.

The remainder of the dissertation is structured as follows. Chapter 2 presents the background and
related work to Bayesian learning. In the chapter, we dive into detail about the methods that have been
used to solve similar problems. The research problem is stated in Chapter 3 as well as the methodology
that is used to solve for it. Chapter 4 focuses on the experimental results and discusses the limitations of
our work. Chapter 5 wraps up the whole research dissertation.

3


Chapter 2

Background and Related Work

2.1 Introduction

The previous chapter introduced the general problem area and the specific problem area for this study. In
this chapter we review works relating to modeling the influence between stochastic processes by using
graphical models. Section 2.3 provides the relevant related work on how other authors have approached
solving similar problems with the first focus being on forecasting transaction volumes, then secondly
looking at optimizing the spread of ATMs and lastly how influence modeling has been used for stochastic
processes.

2.2 Reasoning Patterns in Bayesian Modeling

In the context of graphical models, influence flows in a specific way as demonstrated below on the
three nodes with indirect connections; X,Y and Z. If we remove a node, this changes the influence
relationships between these set of nodes. Take node Z for an example, the influence relationships to Z
can flow from different directions as demonstrated in Figure 2.1 below.

Figure 2.1: Different influence structures that depict the influence relationships of node Z with other
nodes.

Given a Bayesian network structure as demonstrated in Figure 2.2, we know that if we remove node
Z, in graphical models it is well understood that influence moves around through the network through
the conditional independence associations.

4


Figure 2.2: A Bayesian network influence structure

Figure 2.3 below depicts the new influence structure for when node Z is removed from the Bayesian
network that is above. We can guarantee independence between variables only when there is no active
trail between them. If we consider direct and indirect connections we can still have influence moving
around in a graphical model structure. The only time where influence does not move around is when we
have a V structure which blocks influence and this is the only exception. The reason influence does not
move around in a V structure is because 2 nodes are conditionally independent of each other given the
child node is not known.

Figure 2.3: A Bayesian network influence structure with a removed node (Z).

5


2.3 Forecasting Techniques for ATM Networks

Accurate forecasting of ATM transactions is crucial for efficient financial planning and operational man-
agement. Various studies have explored time-series models, machine learning algorithms, and hybrid ap-
proaches to predict cash demand and transaction volumes. Traditional statistical models such as ARIMA
and VARMAX have been widely used, while recent advancements leverage deep learning techniques like
Long Short-Term Memory (LSTM) networks and convolutional neural networks for improved accuracy.
The integration of feature engineering and clustering methods has further enhanced forecasting preci-
sion, allowing financial institutions to optimize cash replenishment schedules and reduce operational
costs.

The Auto-regressive Integrated Moving Average (ARIMA) is a forecasting technique that was sug-
gested by Box-Jenkins [Jenkins 1970] and this has enjoyed application in a variety of forecasting fields.
The ARIMA uses time series data by predicting a value as a linear combination of its past values and
past errors [Kamini et al. 2014]. Given a time series model with the form ARIMA(p, d, q), the paper
Kamini et al. [2014] proposed an ARIMA model, a Multi-Layer Feed Forward Neural Network (MLFF),
Generalized Regression Neural Network (GRNN) and Wavelet Neural Network (WNN) to forecast cash
demand in ATMs. In their study, Kamini et al. [2014] used NN5 competition data that contained daily
time series data of cash withdrawals from 111 ATMs from different parts of England for a period of 2
years.

Firstly, Kamini et al. [2014] used traditional non-chaotic methods for forecasting where missing
values were imputed with the data being deseasonalized with a time lag of 7 days. In the second method
with non-traditional chaotic approaches, Kamini et al. [2014] reconstructed the phase space by using
chaotic parameters, namely, embedding dimension and delay time using a TISEAN tool. The study used
correlation dimension and the false nearest neighbor methods to identify chaos in the data. A sequential
split was performed on the data for each of the ATM centers. The data was divided into 80% for training
the 20% for testing. The best performing model when method 1 was used for foresting was the GRNN
with an SMAPE of 23.16%. In this method, the phase space is reconstructed using a delay time of 2 and
an embedded dimension of 2, the GRNN also yielded the best Symmetric Mean Absolute Percentage
Error (SMAPE) of 14.713%.

Rafi et al. [2020] proposed a time series model for predicting cash demand in a network of ATMs
for a specific financial institution in Pakistan. Instead of using the standard ARIMA approach as done by
Kamini et al. [2014], Rafi et al. [2020] built a Vector Auto Regressive Moving Average with Exogenous
Variables model (VAR-MAX) by using the transaction data of each ATM. The study used transaction
data of 7 ATMs from the period of June 2013 to December 2015 for a region with significant financial
activity. The data was pre-processed for parsing into the model while the Dickey-Fuller test was used to
verify the stationary test.

The split ratio was 70% for training and 30% for testing. The proposed model used by Rafi et al.
[2020] puts into account the exogenous variables such as; holidays, salary week, and weekdays. The
data was made stationary by utilizing differentiation techniques. The VARMAX model produced an
average SMAPE of 31.4%, this is the same evaluation matrix that was utilized by Kamini et al. [2014]
but it is significantly larger compared to their results and it must be noted that they used a different
forecasting model and data.

Another prominent forecasting technique is the Long Term-Short Memory Networks (LSTMS) and
these have been proven recently to outperform state of the art univariate time series forecasting methods
[Bandara 2017]. Given a group of similar time series data, the authors Bandara [2017] forecast across
time series databases using Long Term-Short Memory Networks. The LSTMS were used on two bench-
mark data sets, namely the CIF2016 and NN5 data sets. The CIF2016 dataset consisted of monthly time

6


series data composed of two subgroups, one group was related to the banking industry and the other was
an artificially generated series. The forecasts were performed on a total of 72 time series for different
months ahead forecasts.

The second dataset which was the NN5 data contained 2 years of daily cash withdrawals at various
automated teller machines (ATMs) located in the UK. A total of 111 time series data for the ATMs was
provided with the requirement that the forecasts must be performed for a prediction horizon of 56 days
ahead. The NN5 data comprised of different real world forecasting challenges, such as outliers, step
ahead forecasting, missing values, and multiple seasonalities.

It must be noted that it was the same data set that was used by Kamini et al. [2014] but a different
model was used to solve for the same problem. The evaluation function that was used for both the
CIF2016 and NN5 was the Symmetric Mean Absolute Percentage Error (SMAPE). Bandara [2017]
proposed a new LSTM method which they called the LTSM.Cluster, and it yielded the following results
for the datasets, for CIF2016 the SMAPE was 10.5% and for the NN5 dataset, the SMAPE was 21.6%.
In the study by Kamini et al. [2014] the GRNN used in method 2 of their approach outperformed the
LTSM.Cluster used by Bandara [2017] by a margin of 6.887%.

Andrawis et al. [2011] presented models to forecast 111 time series data which was the same data
that was explored by Kamini et al. [2014] and Bandara [2017]. Andrawis et al. [2011] used time series
data that span a period of 2 years with the aim of forecasting for a horizon of 56 days. The authors
experimented with a combination of multiple forecasts and models for their study. Model pre-testing
was done by firstly taking an important step to deseasonalize the time series data. The models were then
pre-screened by making a comparison of computational intelligence and linear models.

A total of 140 models were experimented with and each of these models had different parameters,
pre-screening and post-processing methods. The models were trained on 97 weeks of data and 8 weeks
of data were used for testing. A total of 9 models turned out with the best accuracy, contrasting this
study with Kamini et al. [2014], it used 4 more models for forecasting which made it a robust study.
The combination of models were selected based on their accuracy. To evaluate model accuracy, the
Symmetric Mean Absolute Percentage Error (SMAPE) was used, which is the main evaluation metric
for the competition.

Similarly, Jörg [2011] used the NN5 time series data with hybrid models to forecast for a 56 day
horizon. Their contribution to the NN5 competition was placed 6th out of a total of 19 teams. When the
data was preprocessed it exhibited a weekly seasonality as evidenced on the autocorrelation function.
A total of 3 different models were trained with the combination being a neural network ensemble, an
ensemble of nearest trajectory models and lastly a model for the 7 day cycle. A weighted mean was
calculated by combining all 3 models and the weights were inversely proportional to the forecast errors
of the validation set. The evaluation metric used was the Symmetric Mean Absolute Percentage Error
(SMAPE). The combined SMAPE for the validation set was 19.38%.

Coyle et al. [2010] utilized self-organizing neural networks on the same NN5 forecasting time
series data that has been surveyed on the above literature. The self-organizing fuzzy neural network is a
specific type of a fuzzy neural network [Leng and Prasad 2004]. The SOFNN is a neural network with
five layers and can self-organize. Coyle et al. [2010] utilized structure learning for the learning process
of the SOFNN and for parameter learning. The Symmetric Mean Abolute Percentage Error (SMAPE)
was also used a measure to evaluate the SOFNN predictive outcomes, a mean SMAPE of 20.4% was
achieved.

Rajwani et al. [2018] used exploratory data analysis and LSTMs for ATM cash flow prediction.
ATM transaction data was collected for ten ATMs that were distributed in the Karachi region. The data

7


was collected for a period of 2 and a half years with 120 246 rows. A total of 60 features came with the
data before being pre-processed. After performing exploratory data analysis and a few transformations
of the data, Rajwani et al. [2018] found a few interesting observations about the data.

It was noticed that most transactions take place at the beginning of the month and at the end of the
month. This can be attributed to the fact that people are most likely to be paid at the end or the start of
the month. The split ratio for the experimentation was 70/30, with 70% for training and 30% for testing.
This split ratio is similar to the ratio used by Rafi et al. [2020] in their study. The LSTM model had a
split ratio of 67/33 and 20 epochs were ran. This yielded a MSE of 0.027 on the training data set and
a MSE of 0.028 on the testing dataset. The RMSE indicated that they were 0.027 away from the actual
target.

Simutus et al. [2007a] explored the use of neural networks and support vector machines to forecast
cash demand for ATMs. Simutus et al. [2007a] used simulated and real data for training the models
and evaluating their performance. The data was gathered or simulated for a period of 2-3 years for
training a reliable ANN and SVR on a total of 15 ATMs. All ATM machines had a separate three layer
feed-forward neural network with the input variables being coded values of weekday, day of the month,
month of the year, holiday effect value and average daily cash demand.

The ANN was trained by utilizing the Levenberg-Marquardt optimization method with the output
variable being the cash demand for the ATM for the next day. The data split ratio was 70% for training
and 30% for testing. The MAPE of the ANN on the simulated data was 0.76% while for the SVR it
was 4.1%. Since there is a total of 15 ATM sites, the authors took the average MAPE for both models
and the following were the results; the average MAPE for all 15 ATMs when using the ANN model was
20.6% while for the SVR it was 25.1%.

Bhandari and Gill [2016] proposed an integrated technique for forecasting which uses the back-
propagation algorithm with a genetic algorithm. In their research article, Simutus et al. [2007a] used
ANNs and SVMs, in this article, Bhandari and Gill [2016] used a hybrid neural network for robustness.
The proposed hybrid model that used BP/GA started with the collection of data, normalization, feature
extraction, training and testing. A combination of 5 input variables were used and these were; day
number, week day, weekend day, salary effect and holiday effect. Different population sizes were used
with a different number of hidden neurons and iterations. The maximum population size achieved a
Mean Absolute Error (MAPE) of 1.85%

In addressing the improvement of forecasting cash demand for ATMs, Arabani and Komleh [2018]
used a Convolutional Neural Network on data that was provided by an Iranian banking institution. Daily
ATM cash withdrawal data was used as input dataset for forecasting. Simutus et al. [2007a] utilized
both methods for their study and they added a statistical method and CNNs. The statistical method starts
with an F-test until it produces a regression model. The remaining methods split the data for training
and testing, all four methods were evaluated by using the Root Mean Squared Error (RMSE). The CNN
produced the best results from all 4 methods with an RMSE of 4.74% with cross validation.

Fallahfati et al. [2022] also studied the forecasting of ATM cash demand but in their case they
looked at 2 scenarios, this being before and after the COVID 19 pandemic. Their aim was to accurately
predict ATM cash demand by making use of statistical and machine learning models. The authors
identified 3 categories for ATMs based on accessibility and environmental factors. The categories were
namely; residential districts, business districts, and shopping and recreation districts. A 3 year period
of daily withdrawal data from 2017 to 2020 was used. The 3 year time frame was split by the period in
which the COVID-19 pandemic was observed. Parametric and non-parametric approaches were used for
forecasting ATM cash demand. The authors used 5 parametric models and 4 non-parametric machine
learning models. A total of 12 features were selected based on market dynamics in Tehran, Iran. To

8


evaluate the accuracy of their predictions, the authors used the Mean Squared Error (MSE). The best
performing model was the regression Random Forest (RF) approach with an MSE of 0.0101.

Simutus et al. [2007a] explored the use of neural networks and support vector machines and sim-
ilarly, Ramı́rez and Acuña [2011] also made use of ANNs and SVMs. The authors utilized the NN5
competition data with a focus on daily cash withdrawals for ATMs in England. A total of 691 data
points were used for training from 30 ATMs, and 100 data points were used for testing purposes. The
standard evaluation metric for the NN5 competition is the SMAPE and the authors utilized the same
metric to evaluate their models. The best performing models were the MLP with a mean SMAPE of
21.37%.

Gurgul and Suder [2016] studied the calendar and seasonal effects on daily ATM withdrawals for
ATMs that are managed by the Euronet network. The study used was based on ATM data for two regions
and these were namely, Lesser Poland and Subcarpathia. A total of 222 ATMs were used for a period
starting from 2010 January to 2012 December. The authors were able to prove the existence of calendar
effects on ATM withdrawals. Different days exhibited different withdrawals and this dependence was
on the day of the week, week of the month or the month of the year.

Previous authors that are surveyed above mostly use machine learning approaches for forecasting
ATM cash demand, Venkatesh et al. [2013] proposed the use of neural networks with the clustering
of ATMs by grouping them according to features. The ATMs were clustered by making use of the
Levenshtein SAM distances. The authors made use of the NN5 cash withdrawal data for a period of
2 years. A total of four neural network architectures were used and these were; Multi-Layer Feed
Forward (MLFF), Wavelet Neural Networks (WNN), Group Method of Data Handling (GMDH) and the
Generalised Regression Neural Network (GRNN). The GRNN achieved the the best SMAPE of 23.16%
and the overall approached utilized by the authors outperformed the results of Andrawis et al. [2011].

Sarveswararao et al. [2022] studied ATM cash demand for an Indian bank by modeling chaos and
making use of deep learning. A total of 50 ATMs were used for the study and 28 of the ATMs were
observed to exhibit chaos and the 22 remaining ATMs did not have chaos observed in their data. A
combination of a statistical technique with machine learning techniques that ranged from deep learning
techniques that are Long Short Term Memory (LSTM) neural networks. A total of 12 different tech-
niques were utilized for forecasting ATM cash demand. The deep learning LSTM model outperformed
all other models and it achieved a mean SMAPE of 24.17%.

As a predecessor to the work done by Simutus et al. [2007a], Simutus et al. [2007b] explored the use
of a flexible neural network for forecasting ATM cash demand. The ANN was also trained by making
use of the Levenberg-Marquard optimization method and the data was split to a ratio of 70% for training
and 30% for testing. Simulation data was generated to depict the behavior of a typical ATM in Kaunas
city, in Lithuania. The Mean Absolute Percentage Erros (MAPE) values varied between 15-20% for
daily predictions and they varied between 8-15% for weekly predictions.

Suder et al. [2007] studied the effectiveness of forecasting methods for ATM withdrawals under
different market conditions. Suder et al. [2007] also evaluated whether the pandemic-driven market
conditions had an impact on the models for predicting withdrawals similarly to what Fallahfati et al.
[2022] explored in their study. The data was gathered from one of the largest ATM operators in Poland
for a total of 61 ATMs that consisted of daily withdrawal data for a period of 4 years. A combination
of statistical and machine learning methods were used for forecasting ATM cash withdrawals. The
XGBoost technique outperformed all the methods that were tested and it produced a combined mean
SMAPE of 37.92% for both pre and post the COVID-19 pandemic.

ATM demand forecasting focuses on the daily demand of each ATM, and not on the denominations

9


dispensed [du Toit 2011]. A total of 18 ATMs were used to conduct the study on forecasting the demand
for denominations. A client normally withdraws an amount at an ATM and receives different denomi-
nations that make up the amount. du Toit [2011] focused on forecasting the aggregated denominations
instead of forecasting the demand for each denomination. The ATM data was in the form of a time
series and it exhibited multiple seasonal patterns of different lengths. The empirical daily dispensing
data was collected for a period of 3 years, from January 2007 to November 2009. du Toit [2011] em-
ployed a variety of forecasting techniques, namely; regression, decomposition, exponential smoothing
and Box-Jenkins methods.

du Toit [2011] used the Mean Absolute Percentage Error (MAPE) and the results for each technique
were tabulated for comparison. The author developed models for daily, weekly and monthly forecasts.
The daily forecasts saw an improvement of 21.67% with the MAPE being 58.01%. The new weekly
forecast had a MAPE of 10.55% with an improvement of 18.91% from the previous forecasting tech-
nique. Lastly, the MAPE for the monthly forecasts was 3.17% with an improvement of 42.36%.

Keyter [2004] used regression analysis to forecast transaction volumes on ATM devices for a South
African bank. A variety of variables were taken into account before regression was performed, and this
is due to the fact that these variables influence the transaction volumes differently. A final sample of 86
ATM sites was taken into account out of a population size of 1800 ATM sites. A different combination
of the features out of a total of 18 features were tested, this was done until the combination producing the
highest R2 was found. ATMs can be located at different sites namely, shopping center, industrial areas,
garages, convenience stores and business nodes. The results for the shopping centers used 3 features and
the average residual was 29% with R2 = 0.85.

Table 2.1: A table comparing different forecasting techniques that were used by other authors.
Author Data Models Accuracy

[Kamini et al. 2014] 111 Time series data GRNN SMAPE= 14.713
[Bandara 2017] 111 Time series data LSTM.Cluster SMAPE = 21.6

[Andrawis et al. 2011] 111 Time series data Combined (all 9 models) SMAPE = 18.95
[Rafi et al. 2020] Transaction data VARMAX SMAPE= 31.4

[Keyter 2004] Data for 86 ATMs Regression Analysis R2 = 0.85

[Rajwani et al. 2018] Transaction data LSTM MSE= 0.028
[Simutus et al. 2007a] Transaction data ANN MAPE= 20.6
[Simutus et al. 2007a] Transaction data SVR MAPE= 25.1

[du Toit 2011] Denomination data Fuzzy ARIMA MAPE= 3.17
[Jörg 2011] 111 Time series data Hybrid models SMAPE= 19.38

[Coyle et al. 2010] 111 Time series data SOFNN SMAPE= 20.4
[Bhandari and Gill 2016] 111 Time series data Hybrid ANN MAPE= 1.85

[Arabani and Komleh 2018] Transaction data CNN RMSE= 4.74
[Fallahfati et al. 2022] Transaction data Regression RF MSE= 0.0101

[Ramı́rez and Acuña 2011] Transaction data MLP SMAPE= 21.37
[Venkatesh et al. 2013] Transaction data GRNN SMAPE= 23.16

[Sarveswararao et al. 2022] Transaction data DL LSTM SMAPE= 24.17
[Simutus et al. 2007b] Transaction data ANN MAPE= 14 − 20

[Suder et al. 2007] Transaction data XGBoost SMAPE= 37.92

10


2.4 Optimizing Cash Flow and Resource Allocation

Optimizing cash flow and resource allocation in ATM networks is essential because banks must balance
the need to minimize costly operations such as cash transportation, insurance, and cash freezing costs
with the imperative of ensuring that ATMs remain adequately stocked to prevent service disruptions
and customer dissatisfaction. This presents a Multi-Objective optimization problem, a multi-objective
optimization problem (MOP) can be defined as an optimization problem with multiple objectives mea-
sured with different performance functions, usually in conflict with each other, and a set of restrictions
[Rodriguez and Lozano 2008]. Li et al. [2020] proposed an end to end framework for solving multi-
objective optimization problems using deep reinforcement learning(DRL) and termed it DRL-MOA.
The MOP are decomposed into scalar optimization sub-problems and then each problem is modeled as a
neural network. The Pareto optimal solutions were obtained through the trained neural network models.
The commonly known multi-objective traveling salesman problem was solved by using (DRL-MOA) by
modeling the sub-problem as a Pointer Network.

Armenise et al. [2012] investigated the optimization of ATM cash by using a genetic algorithm in
order to produce optimal load strategies. A pool of 30 ATMs was used as a test bench, and the features
were; location, position, cash capacity, and usage. Similarly to how we want to reduce financial costs,
the paper focused on reducing the costs of unused stock in ATMs.The implemented algorithm made use
of JENES which is an open source Java library for genetic algorithms. The standard deviation, average
and the p-value were used as evaluation functions for the various combinations, and the results were
tabulated for each for the study.

Cisternas-Vera et al. [2016] looked at the impact of mobile banking on reshaping the bank branch
network for an American (US)-based bank. Taking the geo-coded transaction data, a dynamic structural
model was developed and this represented the consumers’ preferences for online and physical channels.
The model was used to optimize the branch network in terms of capacity, amenities, location, and
number of branches. The data comprised a sample for more than 500 000 accounts with more than 1.7
billion transactions across all channels from June 2007 to June 2015. A combination of discrete choice
dynamic programming and linear programming was used to develop the model. The authors in the above
study solved the problem of optimization using various techniques.

Bilir and Doseyen [2018] focused on the optimization of ATM and branch cash operations using
an integrated cash requirement forecasting and cash optimization model. The aim of the study was to
optimize the cash supply chain for ATMs and branches for a mid-sized bank in Turkey. In optimizing the
cash supply chain, the main objective of the model was to minimize idle cash levels without decreasing
customer satisfaction. The optimization phase of the study formulated an optimum transfer schedule for
branches and ATMs by making use of an integer formulation. The aim of the optimization model was to
minimize total cost, which had two parts, transfer costs and the penalty fee for maintaining idle cash.

Broda et al. [2014] presented an approach for optimizing fill-in strategies for ATMs for a bank in
Serbia. The Credit Agricole Bank had an aim of minimizing costs which were listed as; cash freezing
costs, transportation costs and insurance costs while ensuring customer service was not impacted. The
data used were daily transactions for a total of 40 ATMs in the region of Novi Sad. The approach used
by the authors was to firstly model daily cash demand by using a combination of time series analysis,
linear regression and the compound poisson process. Lastly, the authors then optimized the ATM filling
strategies by using a simple algorithm that performs an exhaustive search for a feasible solution space.

Ekinci et al. [2014] attempted to optimize the ATM cash replenishment strategies by utilizing group-
demand forecasts. The data for a total of 152 ATMs in Istanbul was used for a period of 2 years and 6
months. Based on the longitude and latitude of each ATM, these were grouped via an equation which
generated a total of 27 groups. Cash withdrawals were averaged daily for each group and forecasting

11


techniques were then applied for each group. An optimal time between two replenishments was found
by using an objective function with constraints.

Nigmatuli and Chaganva [2022] created a cash management system for ATMs and information
payment terminals. The data was sourced from Kaggle which is open source and it represented cash
withdrawals for a period of 6 years. The study was conducted on a total of 3 ATMs in India, Bangalore.
A combination of forecasting models were used to sufficiently obtain a good level of correct forecasts.
The triple exponential model, singular spectrum analysis model, and a recurrent neural network with
Long Short-Term Memory were used for forecasts. A 2 week forecast horizon was used for each ATM
and the discrete optimal control was used for optimization. Similarly, Lazaro et al. [2017] attempted to
improve cash logistics for bank branches for their case study but they used a similar approach that was
utilized by Nigmatuli and Chaganva [2022]. Lazaro et al. [2017] attempted to optimize cash logistics by
firstly forecasting transactions by using machine learning and applying a robust optimization model. The
data was sourced from a national bank for a period of 4 years. The cash demands that were generated
were then fed onto a defined integer programming problem for optimization.

Dilijonas et al. studied the problems of management and optimization of ATM network systems.
A flexible artificial neural network approach was used for cash demand forecasting. Each machine was
assigned a separate 3 layer feed-forward neural network and these were trained based on the Levenberg-
Marquardt optimization method. A combination of multi-agent technologies and ANNs were used for
optimization. The authors found that a combination of the technologies gave an advantage for managing
cash optimization dynamically in complex systems. The model resulted in 20-30% average reduction in
cash.

Serengil and Ozpinar [2019] much like Nigmatuli and Chaganva [2022] and Lazaro et al. [2017]
utilized cash flow prediction with a combination of an optimization technique. The data was sourced
from a Turkish bank for a total of 6500 ATMs. The data span a period of years starting from 2013.
An artificial neural network was used to forecast the next day’s demands with a total of 15 features.
The cash demand forecast from the neural network model were then consumed to compute the optimal
replenishment strategy using a cost function that used linear optimization.

Thanh et al. [2023] viewed the ATM cash replenishment problem as a multi-objective logistics
optimization problem. The authors extended the Vehicle Routing Problem to formulate a Multi-objective
Optimization Problem (MOP). The authors stated the problem as a mathematical problem which became
a complete directed graph with vertices. Constraints were then applied to the model and these needed
to be satisfied in order to come up with an optimal load strategy. The OR-Tools package was used for
optimization by initializing variables based on the constraint of the MOP. The testing and evaluation of
the model was done by using data from a bank in Hanoi, Vietnam.

Van Anholt and Vis [2010] proposed a novel dynamic approach that factors in both ATM cash
demand forecasting and ATM replenishment in an integrated way. The authors proposed decomposition
on the forecasting component of the integrative approach. On this case decomposition was a changed
linear regression technique, based on the demonstrated results, the decomposition outperformed the
artificial neural network approach. Ghodrati et al. [2013] studied the ATM cash management problem
by utilizing a genetic algorithm. They sourced their data from an Iranian bank for a period of 2 years
for 164 ATMs. The ATMs were clustered into 3 groups of high, medium and low. A genetic simulation
optimization algorithm was implemented as a solution.

Chotayakul et al. [2013] formulated the cash inventory problem for ATMs as a Mixed Integer Pro-
gram (MIP). They developed it into the path formulation in deriving the near-optimal solution of the
problem. The ATM network that was studied came from a bank based in Thailand. Bati and Gozupek
[2017] devised an integer linear program (ILP) that was able to jointly optimize routing and cash man-

12


agement for a new generation of ATMs. Their aim was to minimize costs like idle cash costs and costs
associated with logistics. A polynomial-time heuristic algorithm was implemented by using CPLEX.

Kurdel and Sebestyenova [2013] viewed the routing optimization problem as a dynamic vehicle
routing problem (VRP) which is similar to what was explored by Thanh et al. [2023]. Kurdel and
Sebestyenova [2013] solved the formulated problem by using a parallel genetic algorithm. Kiyaei and
Kiaee [2021] studied ATM cash replenishment planning by making use of a deep Q-network for opti-
mization. This model used a combination of deep learning and Q-learning. In their study, Kiyaei and
Kiaee [2021] came up with two components for solving the problem. The first component learns the
dynamics of the cash demand while the second one uses Q-learning to learn the action-value function.

Mubiru and Nalubowa Ssempijja [2024] attempted to model and optimize cash load strategies for
ATMs under stochastic demand. A total of ten ATMs were used for the study for commercial hubs in
the city of Kampala. Each ATM’s data was based on hourly transactions for a period of 4 months. The
authors were able to develop an optimization model which used the Markov decision approach for cash
loading strategies. Li et al. [2009] tackled a unique problem of selecting the best site for a new ATM
by using the Particle Swarm Optimization algorithm. Their study focused on a bank based in China,
Tianjin, and they surveyed a total of 50 ATMs. In their preliminary results, Li et al. [2009] demonstrated
that they can address the site selection problem while at the same time being able to forecast transactions
for the same site.

Table 2.2: A table comparing different optimization techniques that were used by other authors.

Author Data Models
[Armenise et al. 2012] Transaction data Genetic Algorithm

[Cisternas-Vera et al. 2016] Transaction data Linear and dynamic programming
[Bilir and Doseyen 2018] Transaction data Integrated model

[Broda et al. 2014] Transaction data Exhaustive search
[Ekinci et al. 2014] Transaction data Linear programming model

[Nigmatuli and Chaganva 2022] Transaction data Optimal control
[Lazaro et al. 2017] Transaction data Integer programming

[Dilijonas et al.] Transaction data Multi-agent technologies
[Serengil and Ozpinar 2019] Transaction data Cost function and linear optimization

[Thanh et al. 2023] Transaction data Pareto-based multiobjective learning algorithms
[Van Anholt and Vis 2010] Transaction data Decomposition model

[Ghodrati et al. 2013] Transaction data Genetic Algorithm
[Chotayakul et al. 2013] Transaction data Mixed Integer Program (MIP)
[Bati and Gozupek 2017] Transaction data Heuristic CPLEX

[Kurdel and Sebestyenova 2013] Transaction data Parallel Genetic Algorithm
[Bati and Gozupek 2017] Transaction data Q-learning

[Mubiru and Nalubowa Ssempijja 2024] Transaction data Markov decision approach
[Li et al. 2009] Transaction data Particle Swarm Optimization

13


2.5 Related Work

2.5.1 Mapping Influence in Complex Systems

Understanding influence relationships in probabilistic graphical models is essential for analyzing the
behavior of interconnected systems. Influence modeling within Bayesian networks has been applied
across domains to capture dependencies and quantify structural changes. Research in this area pri-
marily employs structure learning techniques, including score-based and constraint-based methods, to
recover underlying influence patterns from data. These approaches have been used in diverse applica-
tions, such as modeling stochastic processes, evaluating resilience in complex networks, and optimizing
decision-making frameworks. By leveraging probabilistic reasoning, influence modeling provides valu-
able insights into how system dynamics evolve under various conditions, including node removal and
disruptions.

Influence diagrams and belief networks are defined as directed graphical models for representing
models of probabilistic reasoning and decision making [Shachter 2007]. Important relationships can be
captured easily with the graphical representation of influence diagrams and belief networks. Influence
diagrams are more suited to capture the conceptual relationship between inference and decision making
[Biedermann and Taroni 2023]. Influence networks defined as Bayesian networks whose probabilities
are approximated by expect provided influence constants [Zaidi et al. 2010]. Influence relations in
graphical models refers to the dependencies or causal connections between variables represented within
a graph [Koller and Friedman 2009].

Ajoodha [2018] attempted to recover the influence relations by using a complete algorithm for a
collection of partially observable multi-dimensional stochastic processes. Score-based structure learn-
ing was used to identify the optimal structure by applying a scoring metric to evaluate structural config-
urations in relation to the data. Ajoodha [2018] expands on score-based structure learning concepts for
tracking influence among stochastic processes.

Firstly, Ajoodha [2018] decomposed the distributions represented by stochastic processes into a
series of temporal models. Then, an assembly and a scoring function was defined to assess the quality of
potential influence networks. This then became a combinatorial optimization problem, which required
the navigation of the search space to find the optimal influence network, the best fitting structure for the
training data was then returned. Ajoodha and Rosman [2020] attempted to study the influence between
stochastic processes that are represented by hidden Markov Models (HMMs). A structure learning
approach was also used in this study and the candidate influence structures were dynamic influence
networks (DINs) which model the relationships between the stochastic processes.

In a different study, Ajoodha and Rosman [2021] used constraint-based structure learning to recover
the direct influence between hierarchical dynamic Bayesian networks. This study still used structure
learning but with focus being on hierarchical dynamic bayesian networks. The learning tasks were
compared by using KL-divergence. In another study, Ajoodha and Rosman [2017] still made use of
score-based structure learning in an attempt to recover the influence structure but in this case naı̈ve
Bayes models were used instead of hidden Markov Models or hierarchical dynamic Bayesian networks
as stated before. A total of four parameter or structure learning tasks were explored to recover the ground
truth distribution.

Mohammadi and Wit [2014] concentrate on Bayesian structure learning in Gaussian graphical
model, addressing both decomposable and non-decomposable cases. They introduced a new Bayesian
framework for determining Gaussian graphical models and developed a birth-death Markov chain Monte
Carlo (BDMCMC) algorithm to carry out both structure learning and parameter learning. To evaluate
the performance of the BDMCMC algorithm, the KL-divergence was used on the precision matrix.

14


Chen et al. [2020] introduced a Bayesian network-based approach to model Nitrogen utilization
efficiency of dairy cows. Historically, previous authors utilized statistical models to predict manure
N excretion. Chen et al. [2020] used a dataset from the digestibility studies from the AGri-Food and
Biosciences Institute in Nothern Ireland. The Growth-Search as a constraint-based structure learning
approach was used to recover the influence between the variables. The Bayesian network model was
able to capture the relationships between the variables and it also established a causal influence model.
In another study, Zheng et al. [2016] used Bayesian networks to model methane emissions from milking
dairy cows. Previously, different regression based models were applied to examine the findings which
is similar to how Chen et al. [2020] extended from statistical models to Bayesian networks. The dataset
was also collected from the Agri-Food and Biosciences Institute in Nothern Ireland. The evaluation
took place on dataset with 934 milking dairy cows. Structure learning was done by using Hill-Climbing
structure learning and it utilized a score-based approach.

Matejová et al. [2023] used influence modeling to study the factors associated with atherosclero-
sis. Cardiovascular diseases have many factors that are associated with atherosclerosis [Matejová et al.
2023]. A combination of statistical methods and different machine learning techniques were utilized
for the study. The Bayesian networks in this instance were used to study the relationships between the
different factors.

Tang et al. [2020] evaluated the resilience of urban transportation systems by utilizing a Bayesian
network model. The study focused on investigating the resilience of road transportation for 4 cities in
China. These cities were Beijing, Tianjin, Shanghai and Chongqing for a period starting in 1998 to
2017. A hierarchical Bayesian network model was proposed to evaluate the resilience of these systems.
Backward inference, forward inference and sensitivity analysis were used to test the proposed model
and the results demonstrated reasonable values. Hosseini et al. [2016] assessed system resilience by
also utilizing Bayesian networks similar to the study done by Tang et al. [2020]. Hosseini et al. [2016]
focused on a case study of a sulfuric acid manufacturing plant that had 2 suppliers of sulfur. The
developed Bayesian network model was able to capture various disruptions ranging from power outages,
technical errors to hurricanes. Hosseini et al. [2016] validated their model by making use of sensitivity
analysis which is the same approach that was used by Tang et al. [2020] for evaluating their model.
Special types of reasoning were performed and these were belief propagation for inference purposes.

Jaradat et al. [2020] modeled the inter-dependencies between critical features by making use of a
Bayesian network. Their case study was focused on an inland waterway port and surrounding supply
chain network. A Bayesian network captured the interdependency between the factors. The results
were probed by making of use of belief propagation and sensitivity analysis. Hossain et al. [2019]
evaluated another study for an interdependent electrical infrastructure system. This study also used a
Bayesian framework to assess system resilience like in Jaradat et al. [2020]. The framework addressed
a range of possible disruptions and proposed solutions to mitigate their consequences. A case study of
the Washington, DC electrical infrastructure system was used. To evaluate resilience, different measures
were used and these were sensitivity analysis, forward propagation, information theory and backward
propagation.

Liu et al. [2020] presented a robust dynamic Bayesian network method for supply chain disrup-
tion risk estimation. The authors proposed a new and general nonconvex programming model, and this
was compared with the classic DBN approach on a case study. Kammouh et al. [2020] evaluated the
resilience of engineering systems by utilizing a probabilistic framework. The authors developed 2 ap-
proaches that could handle static and dynamic systems. A Bayesian Network (BN) was used for static
engineering systems and Dynamic Bayesian Networks (DBN) were used for dynamic systems. The pa-
per made use of 2 illustrative systems to apply the methodology on, one being evaluating resilience for
a transportation system and the other being for Brazil.

15


Table 2.3: A table comparing different influence modeling techniques that were used by other authors.

Author Models for mapping influence
[Ajoodha 2018] Score-based structure learning

[Ajoodha and Rosman 2020] Hidden Markov Models
[Ajoodha and Rosman 2021] Constraint-based structure learning (hierarchical dynamic Bayesian networks)
[Ajoodha and Rosman 2017] Score-based structure learning
[Mohammadi and Wit 2014] Birth-death Markov chain Monte Carlo (BDMCMC)

[Chen et al. 2020] Growth-Search
[Zheng et al. 2016] Bayesian modeling

[Matejová et al. 2023] Bayesian modeling
[Tang et al. 2020] Bayesian modeling

[Hosseini et al. 2016] Bayesian modeling
[Jaradat et al. 2020] Bayesian modeling
[Hossain et al. 2019] Bayesian modeling

[Liu et al. 2020] Dynamic Bayesian network
[Kammouh et al. 2020] Bayesian network and dynamic Bayesian network

2.5.2 Conclusion

The related work explored in this chapter spans multiple domains, including forecasting ATM transac-
tion volumes, optimizing cash flow and resource allocation, and mapping influence in complex systems.
Each of these areas provides crucial insights that contribute to understanding ATM networks as prob-
abilistic graphical models. Forecasting techniques have demonstrated various approaches to predicting
cash demand and transaction volumes, offering methods that aid in resource planning and operational
efficiency. Time-series models, machine learning algorithms, and hybrid forecasting techniques have
been extensively studied to improve accuracy and adaptability under varying market conditions. These
methodologies set a foundation for understanding transaction patterns, which is essential when analyz-
ing network resilience following disruptions. Optimization studies have focused on the logistical and
financial challenges of ATM networks, emphasizing how resource allocation strategies can minimize op-
erational costs while maintaining service availability. Various optimization techniques, including mixed-
integer programming, heuristic algorithms, and simulation-based approaches, have been employed to
enhance cash replenishment scheduling and ATM placement decisions. These strategies highlight the
significance of balancing cash flow efficiency with customer demand to ensure uninterrupted service.

The discussion on influence modeling extends these considerations by examining how interdepen-
dencies between network nodes impact overall system behavior. Studies in this area provide a framework
for understanding the propagation of disruptions across networks and how probabilistic graphical mod-
els, such as Bayesian networks, can be employed to analyze these interactions. Research on network
resilience, influence maximization, and cascading failures in complex systems further contextualizes
the importance of modeling ATM networks as interconnected structures subject to perturbations. By
integrating these themes, this chapter establishes a comprehensive foundation for the research problem
addressed in this dissertation. The intersection of forecasting, optimization, and influence modeling pro-
vides a structured approach to investigating how ATM networks respond to node removal. The insights
gained from prior work inform the methodology adopted in subsequent chapters, where the impact of
node removal on Bayesian networks is analyzed. This study contributes to the broader research land-
scape by applying Bayesian learning techniques to quantify structural and probabilistic changes in ATM
networks, bridging the gap between theoretical modeling and real-world financial systems.

16


Chapter 3

Methodology

This chapter outlines the methodological approach for analyzing the impact of node removal on Bayesian
networks representing ATM networks. Our study aims to develop a comprehensive framework for un-
derstanding how the removal of nodes which represent ATMs affects the overall network structure and
its probabilistic dependencies. We begin by formulating the problem mathematically, modeling the ATM
network as a directed acyclic graph (DAG) with conditional probability distributions (CPDs) assigned to
each node. This formulation enables us to capture the complex interactions and dependencies between
transaction volumes at different ATMs.

We then present the research objectives, accompanied by a detailed experimental design to in-
vestigate the effects of node removal on the joint probability distribution and the recovered network
structure. Our methodology incorporates structure learning and parameter estimation techniques to re-
cover the network post-removal, and Kullback-Leibler (KL) divergence to quantify deviations from the
original distribution. By modeling these dynamics, we aim to provide insights into the resilience and
stability of ATM networks under varying disruption scenarios. This approach contributes to the broader
understanding of influence relationships in probabilistic graphical models, particularly in applications
requiring robust and adaptive network architectures.

3.1 Problem Formulation

Many problems in the field of science can be modeled as stochastic processes. Stochastic processes
are used to describe the evolution of random variables over time or space. Applications of stochastic
processes can be used for modeling and analyzing random phenomena, giving a clear understanding into
how the complex systems interact under uncertainty. We can learn the influence between these variables
if we know that they interact with each other. Suppose we have a set of variables X = {X1, ....,Xn}

and we have a joint distribution PG(X ∣ Y ) that represents the relationship between them. A Bayesian
network that represents our problem as a tuple B = (G, Θ) can be generated where G is a directed acyclic
graph (DAG) whose nodes represent the set of variables from above X = {X1, ....,Xn} and Θ represents
the parameters of the probability distributions associated with each node in the graph. These parameters
define how likely certain values are for a given node, conditioned on its parents in the graph.

Suppose that a subset of the variables in this set X = {X1, ....,Xn} had to be mutilated, how would
this affect the joint probability distribution PG(X ∣ Y ) that represented the relationship between these
variables. Firstly, we will attempt to learn the parameters of a joint probability distribution which
describes the relationship between the variables X = {X1, ....,Xn}. We will then attempt to learn
parameters of a probability distribution which describes the relationship between the variables X =

17


{X1, ....,Xn} when a subset of these variables is removed. A probabilistic graphical model can be
used to represent the dependencies between different variables, with this graphical model we can then
perform inference and learning for knowledge discovery.

3.2 Research Questions and Experimental Setup

Based on our problem formulation, we looked at three research questions:

Research Question 1: What is the impact on the KL divergence when nodes are removed from
a recovered Bayesian network?

To address this question, we systematically mutilated Bayesian networks by removing nodes and
the edges connected to them. For each modified network, we performed structure learning using a score-
based approach, estimated the CPDs with Bayesian estimation, and compared the recovered network to
the original ground truth distribution. The KL divergence served as the primary metric for quantifying
the deviation between the original and recovered distributions. By experimenting with varying percent-
ages of node removal and analyzing the results, we identified patterns and thresholds where network
recovery begins to degrade significantly.

Research Question 2: What are the implications of varying structural complexity on the re-
covery of Bayesian networks and their KL divergence?

We explored this question by generating synthetic Bayesian networks with distinct structural con-
figurations, including tree-like, sparse, and dense structures. For each structure type, we evaluated the
performance of structure learning algorithms under different sample sizes and levels of node removal.
By comparing the KL divergence across these configurations, we assessed how structural complexity
influences the network’s recoverability. This analysis highlighted the trade-offs between complexity and
robustness, providing insights into the resilience of different network designs under disruptions.

Research Question 3: How can insights from Bayesian networks improve the operational
efficiency and planning of ATM networks?

To tackle this question, we modeled ATM networks as Bayesian networks, where nodes represented
individual ATMs, and edges captured dependencies between their transaction volumes. Using histori-
cal transaction data, we trained and validated these models, then we simulated disruptions (e.g., ATM
outages) by removing nodes. By analyzing the resultant changes in network influence and identify-
ing critical nodes or clusters, we derived actionable insights for improving operational efficiency. This
includes optimizing cash replenishment schedules, predicting the impact of outages, and planning the
deployment of new ATMs to enhance network resilience and customer satisfaction.

3.2.1 Experimental Setup

In this study, we aimed to investigate the effect of removing a subset of variables from this set X =
{X1, ....,Xn} on the joint probability distribution of the graph which can be represented as PG(X ∣ Y ).
We modeled the relationship between the variables using a Bayesian network as a tuple B = (G, Θ)
where G is a directed acyclic graph (DAG) whose nodes represent the set of variables from above X =
{X1, ....,Xn}. We generated a random ground truth distribution for this Bayesian network by creating
a Bayesian structure and assigning conditional probability distributions (CPDs) to each node in the
network.

The structure of the Bayesian network was generated by randomly specifying the number of nodes
and randomly connecting them with edges to form a directed acyclic graph (DAG). Each node in the

18


network had a CPD assigned based on its parents in the graph. We generated the CPDs randomly by
using a distribution that best described the nature of our variables and met the assumptions of our model.
We chose from a variety of distributions, this being from a Bernoulli distribution, Gaussian distribution,
Dirichlet distribution and we ended up using the Bernoulli distribution which was a specific distribution
which best met our assumptions of our model and described the nature of our chosen variables.

When the structure of the Bayesian network and the CPDs were defined, we generated random
samples from the Bayesian network to create a dataset. Suppose that the learned probability distribution
from structure learning of our Bayesian network that we generated was represented by PG(X ∣ Y ), and
suppose that we remove a node or a subset of nodes from the graph and we learn the new probability
distribution as P

Ĝ
(X ∣ Y ). We can make use of our generated dataset from the random samples to

perform probabilistic queries that will help us gauge how removing a node or a subset of nodes influences
other nodes in the network.

We choose Bernoulli conditional probability distributions (CPDs) for each node because, after quan-
tizing ATM transaction volumes into binary states (e.g., “high-volume” vs. “low-volume” activity), the
presence or absence of significant transaction patterns is most naturally modeled as a two-state random
variable. Bernoulli CPDs thereby allow us to capture essential fluctuations in ATM usage while main-
taining tractable inference: exact likelihood calculations and structure-learning algorithms (hill-climb,
tree search) remain computationally efficient even for networks of hundreds of nodes. Furthermore, fo-
cusing on binary CPDs simplifies the interpretation of KL-divergence results—any change in node be-
havior manifests directly as shifts in binary probabilities—without confounding effects from multi-state
parameterizations. We acknowledge that real ATM transaction data often exhibits richer, continuous
or count-based dynamics; in future extensions, Gaussian or Poisson CPDs could be incorporated once
sufficient real-world data is available, but the Bernoulli framework provides a robust proof-of-concept
for analyzing network resilience under node removal.

3.3 Application Domain

Each variable can be represented as a node in a graphical model, with associated variables being the
parameters of each node. We have now represented the problem as a probabilistic graphical modeling
problem, firstly we will represent the problem statement as a graphical model, generate random Bayesian
networks for the true distribution, and show how inference can be done on the model for knowledge
discovery.

3.3.1 Modeling Influence Relationships

Suppose we have this set of variables X = {X1, ....,Xn} and we have a joint distribution that represents
the relationship between them. We can formally define a Bayesian network that represents our problem
as a tuple B = (G, Θ) where:

• G is a directed acyclic graph (DAG) whose nodes represent the set of variables from above
{X1, ....,Xn}.

• Θ is a set of parameters that specify the conditional probabilities ΘXi ∣ Pa(Xi) for each node Xi

given its parent Pa(Xi) in G.

19


Figure 3.1: An influence structure between stochastic processes. The red arrows indicate the influence
between these processes while the solid arrows indicate the parameters of each node.

Let P1 in Figure 3.1 represent a node in the network which is made up of n different nodes. If
we assume that the future is conditionally independent of the past given the present, this system would
maintain the Markov property allowing us to make the Markov assumption. If the structure or the
parameters of the system do not change under any transformations, this would maintain the invariance
assumption of the system allowing us to make the stationary assumption.

Our ground truth distribution was generated by following the process that we touched on in the
experimental set up section. We firstly defined the graph structures by choosing to use between 50-100
variables that were on the nodes in our graph structure, which was a directed acyclic graph (DAG) that
is represented by Figure 3.1. We explored tree-like, sparse and dense structures for our ground truth
distributions. We generated random Bernoulli distributions that best described our graph in Figure 3.1.
These distributions specified the conditional probability distributions (CPD) of each node in the graphi-
cal structure. These then became our random structures with random parameters that best described our
ground truth distribution. Once we had the structure of the Bayesian network and the CPDs defined, we
used forward sampling to generate random samples from the Bayesian network and created a dataset
that we used for inference and learning.

For this study, we explored the below.

• We explored the following structures; tree-like, dense and sparse structures.

• We explored the Bernoulli distribution for our Conditional Probability Distributions.

3.4 Mutilating the Network

In the experimental section, we mentioned that we have a set of variables X = {X1, ....,Xn} and a
joint distribution PG(X ∣ Y ) that represents the relationship between them and, if we remove a variable
or a subset of variables and we learn a new joint probability distribution that can be represented as
P
Ĝ
(X ∣ Y ). We want to gauge how removing a variable or a subset of variables influences other nodes

in the network. We will remove a random subset of the variables by starting off at 4 variables and
incrementing this number randomly as we try to evaluate how much does the distribution deviate from
the original distribution. When we remove variables, we will also be removing the edges that lead up to
those variables that are represented as nodes in the graphical structure. We will measure this deviation by

20


making use of the Kullback-Leibler (KL) divergence which measures how one probability distribution
diverges from a second, expected probability distribution.

3.4.1 Learning

Now that we have mutilated our graphical model, we will need to learn the new structure and param-
eters from the generated data. There are two main types of learning in Probabilistic Graphical Models
(PGMs), namely parameter estimation and structure learning. We will first learn the graphical struc-
ture of our data from the mutilated network then proceed with estimating its parameters for our ground
truth distribution. Secondly, we will learn the optimal network structure that best represents the depen-
dencies between our variables which in this case will be ATMs. Learning in PGMs is the cornerstone
for constructing models that accurately capture the underlying probabilistic relationships in the data.
This allows for the creation of models that can make accurate predictions and inferences based on the
available data.

3.4.1.1 Parameter Estimation

Parameter estimation refers to the process of determining the parameters of the conditional probability
distributions (CPDs) associated with each node in a graph [Bishop 2006]. The probabilistic relationships
between the variables are specified by these parameters and they are crucial for making predictions and
performing inference. If the structure of the PGM is known, in which we do have in this case, we can
use the Maximum Likelihood Estimation (MLE) and Bayesian estimation for parameter estimation. Two
assumptions need to be met before we can perform parameter estimation and these are:

• That D = ξ[1], ..., ξ[M] is sampled from P ∗ with P ∗ being the true or target probability distribu-
tion of variables in the model.

• Instances are independent and identically distributed (IID).

Maximum Likelihood Estimation (MLE) as stated, is one of the methods that can be used to estimate
the parameters of a model based on observed data. The MLE estimate for the parameters is obtained
by maximizing the likelihood function, which is the probability of observing the given data under the
model. For the dataset D = x1, x2, ..., xN where each x(i) is a set of observed values for the variables in
the model, the likelihood function is given by:

L(Θ ∣ D =∏N
i=1 P (x

(i)∣ Θ)

Where Θ represents the parameters of the model. The MLE estimate for Θ can be obtained by
maximizing this likelihood function.

Θ̂MLE = argmaxθ L(Θ ∣ D)

Often, the log-likelihood function is more convenient to maximize and it is expressed as follows:

Θ̂MLE= argmaxθ logL(Θ ∣ D)

Bayesian estimation is an alternative to MLE, it estimates the parameters by incorporating prior
beliefs about the parameters. Bayesian estimation combines the likelihood of the data with a prior
distribution over the parameters to compute a posterior distribution. We can encode our knowledge
as a prior knowledge about θ using a probability distribution. Here we assume that the outcome is
conditionally independent given the parameter θ.

P (x[1], ..., x[M,θ]) = P (x[1], ..., x[M]∣θ)P (θ)
= P (θ)∏M

m=1 P (x[M]∣θ)

P (θ∣x[1], ..., x[M]) =
P (x[1],...,x[M]∣θ

P (θ)P (x[1], ..., x[M])

21


Figure 3.2: A visual representation of a dataset with θ representing the parameters.

If the prior is a Beta distribution, then the posterior is a Beta distribution. As we obtain more data,
the effect of the prior diminishes. Thus the Bayesian framework allows us to trade-off a diminishing prior
as more data becomes available. One common choice for a prior distribution is the Dirichlet prior. The
Dirichlet distribution is a generalization of the Beta distribution. If the prior, P (θ), is Dirichlet then the
posterior, P (θ ∣ D), is Dirichlet. If P (θ) is Dir(α1, ..., αk), then P (θ ∣ D) is Dir(α1 +M[1], ..., αk +

M[k]) where M[k] is the number of occurrences of xk

3.4.1.2 Structure Learning

In Bayesian networks, structure learning involves determining the optimal network structure that best
represents the conditional dependencies between variables [Koller and Friedman 2009]. There are two
main approaches for structure learning in Bayesian networks, namely Constraint-based methods and
Score-based methods. The problem statement for structure learning can be formulated as below.

• We assume that the dataset D = {ξ[1], ..., ξ[M]} consists of independent and identically dis-
tributed samples drawn from the true distribution P ∗(X )

• P ∗(X ) is induced by a Bayesian network G∗ over X

• To what extent do independencies manifest in G∗ that manifest in D

We want to find the Bayesian network structure, G, that best represents the dependencies that man-
ifest in D.

Constraint-based approaches are methods that rely on statistical tests to infer conditional indepen-
dence relationships between variables. These methods build an independence map (I-map) which rep-
resents the conditional independence properties of the distribution. We aim to capture the network
structure that accurately represents the independencies in the domain and the basic idea is to build the
best minimal I-map.

Score-based approaches are methods that assign a score to each possible network structure based on
how well it fits the data. Common scoring functions include the Bayesian Information Criterion (BIC),
Akaike Information Criterion (AIC), and the Bayesian Dirichlet Equivalent (BDe) score. These meth-
ods search the space of possible structures to find the one that maximizes the chosen scoring function.
Examples of search algorithms include greedy search, hill-climbing, and simulated annealing.

In the case of our modelling problem for ATMs, by determining the optimal structure of a probabilis-
tic graphical model that best fits the data, one can gain insights into the relationships and dependencies
between these variables (ATM transaction volumes), which can be crucial for both understanding the
underlying processes and making accurate predictions.

22


3.5 Evaluation

In probabilistic graphical models, inference is the process of answering probabilistic queries about the
variables in a model based on observed evidence [Heckerman 1995]. Inference in PGMs can be com-
putationally intensive, which makes it an NP-hard problem. However, these are results for worst case
scenarios and inference with graphical models is mostly manageable for practical applications. We want
to be able to compare our ground truth distribution which we generated by following the method that
we discussed in the experimental setup section with the new distributions of the mutilated Bayesian
networks. We will compare these distributions at the level of probabilistic queries for evaluation.

Suppose that the probability PB(x) = α is the probability of x in the ground truth distribution
and the probability PB†(x) = β is the probability of x in the new joint probability distribution of the
mutilated network. The probability of x in this case would have changed and we can calculate the Mean
Squared Error (MSE) between these two parameters by using the formula below:

MSE = 1
n∏

N
i=1(α − β)

2

To compare the learned distribution from the ground truth distribution we will be using a well
known formalism for doing this which is the KL divergence. The Kullback-Leibler (KL) divergence, is
a measure used in probability theory and information theory to quantify how one probability distribution
differs from a second, reference probability distribution. We will be measuring how many bits have
changed from the learned distribution, and this will be done on the test data or the validation data.

Given two probability distributions P and Q defined over the same probability space, the KL diver-
gence from Q (the approximate distribution) to P (the true distribution) is defined as:

DKL(P ∣∣Q) = ∑x P (x) log (
P (x)
Q(x))

For continuous distributions, the sum is replaced by an integral:

DKL(P ∣∣Q) = ∫
∞

−∞
p(x) log (

p(x)
q(x))dx

Since we are generating the ground truth distribution on our own, we are able to verify the learned
model to the actual ground truth model. Unfortunately, we will not be able to do the same with the
application data which is the ATM data because for that case we do not have access to the ground truth
model. However, we can evaluate the model for the ATM data by using accuracy, recall and precision.
There are several types of inference tasks in PGMs, we will look at variable elimination and approximate
inference for performing probabilistic queries on our learned graphical model.

Figure 3.1 represents an influence structure between stochastic processes, if we reduce the graph
structure by removing a node and the edges leading to P2 we get the below influence structure.

23


Figure 3.3: An influence structure between stochastic processes with a reduced graph structure.

If one of the parameters of a node represents a stochastic process and we remove a node as in
Figure 3.3 from the network after we have learned a probabilistic graphical model, how would this
influence the other nodes in the network of the learned structure and the probability density. We can also
test for probabilistic inference by having an algorithm that removes nodes from the graphical model. If
we let PG(X ∣ Y ) represent the graphical model in Figure 3.1 and let P

Ĝ
(X ∣ Y ) represent the graphical

model in Figure 3.3 which is the reduced graph, we know that PG(X ∣ Y ) ≠ P
Ĝ
(X ∣ Y ). Given the

reduced graph Ĝ, we want to test if we can still do the same inference with Y as we did with the original
graph G. In order to test for this we will utilize the 2 inference approaches that are explained below.

3.5.0.1 Variable Elimination

Exact inference refers to the computation of exact probabilities or marginal distributions of variables
in the model given some evidence [Pearl 1988]. Variable elimination is an exact inference technique
for eliminating variables from a joint distribution by iteratively summing out variables based on the
factorization of the joint distribution. The main idea of variable elimination is to exploit the structure
of the Bayesian network to simplify the computation of marginal probabilities. Below are some of the
properties of the structure of a Bayesian network:

• Some sub-expressions of the joint distribution only depend on a small number of variables.

• By caching results we do not have to recompute them.

• Some important properties are:

1. P (A) = ∑B P (A,B).

2. ∑C P (A)P (B,C) = P (A)∑C P (B,C)

Below is an overview of the variable elimination algorithm:

1. Initialization - start with the factors corresponding to the conditional probability tables (CPTs) of
the variables in the network.

2. Elimination - for each variable Xi in the order specified do the following:

24


• Multiply all factors containing Xi to form a new factor Fi.

• Sum out Xi from Fi to get new factor F̂i.

• Replace the factors containing Xi with F̂i in the factor list.

3. Termination - after eliminating all variables, the remaining factors contain the joint probability
distribution of the remaining variables.

4. Normalization - to obtain the marginal probabilities, normalize the factors by dividing by the sum
of their values.

To reduce computational complexity of inference, we can make use of variable elimination as it
eliminates redundant calculations. We can make use of variable elimination to make inference for our
problem in Figure 3.3.

3.5.0.2 (Alternate) Approximate Inference

In Probabilistic Graphical Models, approximate inference refers to a collection of techniques used to
approximate the posterior distribution over variables of interest when exact inference is computationally
infeasible [Wainwright and Jordan 2008]. These techniques attempt to provide a close approximation to
the true posterior distribution while reducing computational complexity. We will utilize two approximate
inference techniques which are namely, Gibbs sampling and Markov Chain Monte Carlo (MCMC). The
Gibbs sampling method attempts to “fix” the sample, by re-sampling some of the variables. It generates
a sample with arbitrary values (except for the evidence), and then re-samples all variables while updating
the evidence in an arbitrary order. Unlike forward sampling, the Gibbs sampling technique takes into
account the downstream evidence. As we repeat the sampling process, we get closer and closer to the
desired posterior:

Pϕ(X) = P (X ∣ e)

The Gibbs sampling technique is an example of a class of methods called Markov Chain Monte
Carlo (MCMC). MCMC is an iterative process where we move from the distribution getting closer to
the desired posterior distribution. A Markov chain is defined as a graph of states over which the sampling
algorithm takes a random walk. The below applies to Markov chains:

• The nodes of this graph are possible assignments of variable x ∈ V al(X)

• The edges are a transition model, T , which specifies for each pair of states x and x̂ the probability
T (x→ x̂)

25


3.6 Modeling the Influence of Transaction Volumes between ATMs

It turns out that the modeling problem that is formulated in section 3.1 is an important problem in real
world scenarios and one of the scenarios is modeling the relationships between ATMs. Automated Teller
Machines (ATMs) are ubiquitous in our society and are widely used for financial transactions. These
devices play a crucial role in the financial system as they give customers the ability to withdraw money,
deposit money, customer convenience and a variety of other benefits to banks and customers. These
machines are often deployed at a large scale in order to serve a client base in spread out geographical
locations. The size and the spread of the ATMs introduces a logistical and financial optimization problem
for banks. Understanding the relationships between ATMs can be a applied to various problems facing
financial institutions.

Analyzing the relationship between ATMs can assist with detecting patterns of fraudulent behavior,
such as coordinated attacks on multiple ATMs. Data driven decision making can enable banks to make
informed decisions about whether to install a new ATM or remove an ATM at a given location based
on the relationship between all the ATMs, forecast transactions and optimize the use of resources. If an
ATM or a subset of ATMs in a network of ATMs were to go off due to ATM bombing, load shedding,
planned maintenance etc. How would this affect the rest of the ATMs in the same region, would cus-
tomers move to transact in the nearest ATM, what associations manifest in the data and how would we
go about detecting vulnerabilities in the network.

Predicting the influence of the different ATMs in the network can be achieved by aggregating the
features of each ATM. We will be modeling the dependencies and relationships between different ATMs
and their transaction volumes by using a probabilistic graphical model. The variables of study will be
mainly the ATM transaction volumes, each node in the Bayesian network will represent a variable. The
nodes are the variables and they are used to represent ATMs. In our case, each node will represent the
transaction volumes of individual ATMs. The probability distribution of each node which is used to
represent ATMs is going to be transaction volumes.

There will be local probability distributions inside of each of these nodes that models the ATM
transaction volumes. The Conditional Probability Distribution for the variable is a function of the vari-
able itself and the parents. This is modeled as a local probability distribution, it can be modeled using a
tree structure, tree CPD, tabular CPD and a Gaussian CPD. In our case where we are studying transaction
volumes which are continuous variables, we have chosen Gaussian CPDs. The dependencies between
the nodes will be represented by edges, for an example the transaction volumes at an ATM might depend
on the transaction volumes at nearby ATMs, the time of day, or external factors like holidays.

Historical ATM transaction data will then be used to learn the Conditional Probability Distributions
and the structure of the Bayesian network. The structure of the Bayesian network will be determined by
using structure learning algorithms. Structure learning is the process of determining the graph structure
that best represents the dependencies among the variables in the model. Parameter learning will be
used to estimate the CPDs for each node given the structure. To analyze the outcomes of the modeling
process we will utilize variable elimination for exact inference. Once we have a structure that represents
the dependencies between the nodes and their CPDs, we can mutilate the network by removing a node
or a subset of nodes and the edges leading up to these nodes.

The method of removing nodes means that we specifying the nodes that we are dropping on the
graphical structure which would modify the structure of the model. Parameter estimation will then be
used to learn the new parameters of the modified structure so that exact inference can be used to see how
the removal of nodes affects the outcomes. Model evaluation will follow the same steps as outlined in
section 3.4 and to measure the deviation between the original structure of the Bayesian network and the
new structure where nodes are removed we will use Kullback-Leibler (KL) divergence which measures

26


how one probability distribution diverges from a second, expected probability distribution.

The figure below, Figure 3.4 is a visual rep