UNIVERSITY OF WITWATERSRAND

MASTERS THESIS

Applying Machine Learning To Classify
Disease Status For Selected Notifiable
Medical Conditions In South Africa.

Student:
Innocent Lino ERONE
Student Number: 1075688

Supervisor(s):
Mr. Michael T. MAPUNDU

Dr. Trevor Graham BELL

A Research Report Submitted to the Faculty of Health Sciences in partial
fulfilment of the requirements for the degree of Masters of Science in Epidemiology

- Public Health Informatics

26 October, 2021

http://www.wits.ac.za


i

Declaration of Authorship
I, Innocent Lino ERONE, declare that this thesis titled, “Applying Machine Learning To
Classify Disease Status For Selected Notifiable Medical Conditions In South Africa.” and
the work presented in it are my own. I confirm that:

• This work was done wholly while in candidature for a research degree at the Univer-
sity of Witwatersrand.

• Where any part of this thesis has previously been submitted for a degree or any other
qualification at this University or any other institution, this has been clearly stated.

• Where I have consulted the published work of others, this is always clearly attributed.

• Where I have quoted from the work of others, the source is always given. With the
exception of such quotations, this thesis is entirely my own work.

• I have acknowledged all main sources of help.

• Where the thesis is based on work done by myself jointly with others, I have made
clear exactly what was done by others and what I have contributed myself.

Signed:

Date: 26 October, 2021

http://www.wits.ac.za
http://www.wits.ac.za


ii

UNIVERSITY OF WITWATERSRAND

Abstract
Faculty of Health Sciences

School of Public Health
Division of Epidemiology and Biostatistics

Masters of Science in Epidemiology - Public Health Informatics

Applying Machine Learning To Classify Disease Status For Selected Notifiable
Medical Conditions In South Africa.

by Innocent Lino ERONE

HTTP://WWW.WITS.AC.ZA
https://www.wits.ac.za/publichealth
http://www.wits.ac.za/publichealth
http://www.wits.ac.za/publichealth


iii

Introduction:
There is a change in disease profiles. Environmental variabilities continue to alter mor-
phological appearances of species necessitating enhancement in diagnostic methods used
to detect diseases. The deterministic approaches applied in the current diagnosis methods
for Malaria and COVID-19 have presented challenges of low sensitivity and specificity. In
this study, we described data structures and disease profiles for Malaria and COVID-19
surveillance data at the National Health Laboratory Services (NHLS), South Africa. We
also explored the application of supervised Machine Learning (ML) to classify and predict
clinical outcomes for Malaria and COVID-19.

Methods:
The COVID-19 surveillance data comprised of 35,202 observations from a unit dataset.
The Malaria data was made up of three files; a demographics file, a laboratory results
file and a travel-treatment history file of which 40,094 observations were deduced. These
datasets were divided into two portions, 75% for model specification and the 25% des-
ignated as out-of-sample testing. We compared three supervised ML classifiers: Support
Vector Machine (SVM), the K-Nearest Neighbor (KNN), Random Forests (RF) with their
variant novelty approaches Isolation Forest (iForest) and One-Class Support Vector Ma-
chines (OCSVM) to predict clinical outcomes for Malaria and COVID-19. To account for
severe label imbalances, the data with majority class labels was under-sampled to obtain an
equal class balance in the target. Novelty detection approaches with iForest and One-Class
Support Vector Machines (OCSVM) were also used in classifying and predicting Malaria
and COVID-19 clinical outcomes.

Results:
Malaria surveillance data was characterized by large proportions of missing data for de-
mographic, syndromic and environmental characteristics. Though complete, compared
to Malaria, COVID-19 surveillance data did not follow tidy-data principles. In evalu-
ating classifier predictive power using out-of-sample data with equal representation of
clinical outcomes, RF yielded the best predictive power with Area Under Curve (AUC)
scores (98%) from Malaria out-of-sample data accounting for distribution weight of clini-
cal outcome. Though not comparable to scores from Malaria data, the RF still scored better
than the SVM and KNN classifiers from out-of-sample evaluation over COVID-19 data.
Generally, lower classifier performance was observed across all models when subjected
to COVID-19 out-of-sample data, where the KNN classifier registered the highest num-
ber of false-positive results. There were significantly higher numbers of False-Negative
predictions with the SVM classifiers compared to the RF and KNN. However, the RF per-
formed slightly better in predicting True-Negative observations. By categorizing data with
minority clinical outcome representation as outliers, OCSVM predicted more negative ob-
servation compared to the iForest.


iv

Conclusions:
This study showed the impact of data quality in disease surveillance with respect to pre-
dictive modeling for Malaria and COVID-19 medical conditions. The data were charac-
terized by large proportions of incompleteness. Individual demographic characteristics,
reported and recorded signs and symptoms among other attributes that hold vital infor-
mation for syndromic disease surveillance were lacking. While supervised ML classifiers
performed well with Malaria out-of-sample data, the same methods produced suboptimal
results with similar surveillance COVID-19 data. Future studies could explore unsuper-
vised ML approaches on the same surveillance data.


v

Acknowledgements

Firstly, I would like to thank my academic supervisors Mr. Michael T. MAPUNDU and
Dr. Trevor Graham BELL for your immense support and insight throughout the project.
Your stimulating discussions informed the direction of this research. Furthermore, I wish
to express my gratitude towards Brenda Nansereko whose thorough peer-review helped
me write a better thesis. Special thanks to the African Union Center for Disease Control,
your support is incomparable. Finally, I would like to acknowledge the academic research
team at the National Institute for Communicable Diseases - South Africa, you made this
research possible!

Innocent Lino ERONE
26 October, 2021


vi

Contents

Declaration of Authorship i

Abstract iii

List of Figures vii

List of Tables viii

1 Introduction 1
1.1 Epidemiology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theoretical Background 5
2.1 Overview of NMC Surveillance in South Africa . . . . . . . . . . . . . . . . . 5
2.2 Disease Manifestation and Management . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Case Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Eradication Efforts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Inferential Classifiers: Rule-sets . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Machine Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5.1 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.2 Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.3 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 Estimating Classifier Performance . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6.2 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Materials and Methods 14
3.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Study Site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Study Population and Data Sources . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Computational Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


vii

3.5 Conceptual Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.6 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.6.1 Curation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.6.2 Data Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.6.3 Feature Selection and Engineering . . . . . . . . . . . . . . . . . . . . 20

3.7 Model Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.7.1 Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.7.2 Classification Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.7.3 Hyper-parameter selection . . . . . . . . . . . . . . . . . . . . . . . . 23
3.7.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.7.5 The k-Nearest Neighbor Method . . . . . . . . . . . . . . . . . . . . . 25
3.7.6 Decision Tree Learning: Random Forests . . . . . . . . . . . . . . . . 26

3.8 Novelty Detection Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.8.1 Isolation Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.8.2 OneClass SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.9 Learning Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.9.1 Contingency Table Metrics . . . . . . . . . . . . . . . . . . . . . . . . 28
3.9.2 Area Under curve (AUC) . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.10 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Results 31
4.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 Malaria Analytical Data . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.2 COVID-19 Analytical Data . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Predicting Probable Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 Predictions using Balanced Datasets . . . . . . . . . . . . . . . . . . . 38
4.2.2 Predictions using Weighted Datasets: Imbalanced Learning . . . . . 40

4.3 Novelty Detection results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Discussion 43
5.1 Malaria and COVID-19 Surveillance Data Profiles . . . . . . . . . . . . . . . 43
5.2 Classification and Prediction of Clinical Outcomes

Malaria and COVID-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 Qualitative Evaluation of Results . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6 Conclusion and Future Directions 48

Bibliography 49

7 Supplementary Tables and Graphs 56


viii

7.1 Missing Value Report - Malaria Data . . . . . . . . . . . . . . . . . . . . . . . 56
7.2 Missing Value Report - COVID-19 Data . . . . . . . . . . . . . . . . . . . . . 57
7.3 Correlation Matrix COVID-19 - Malaria . . . . . . . . . . . . . . . . . . . . . 57

8 Plagiarism Declaration 58

9 TurnItIn Report 59

10 HREC Research Clearance Certificate 60

11 NHLS Research Clearance 61

12 Research Ethics Training Certificate 62

13 Programming and Analysis Codes 63


ix

List of Figures

2.1 NMC Reporting Cascade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 SA Malaria Risk Map December 2018. Image credit: DoH SA . . . . . . . . . 7

3.1 A conceptual framework for Supervised Machine Learning; adapted from
various internet sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Kernel Density Estimate plot for age at testing (years) . . . . . . . . . . . . . 18
3.3 Preprocessing flow - Malaria dataset. . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Preprocessing flow - COVID-19 dataset. . . . . . . . . . . . . . . . . . . . . . 20
3.5 SVM classifier a case of linearly separable data . . . . . . . . . . . . . . . . . 24
3.6 KNN classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.7 Decision Tree classifier branch . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.8 Error Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1 Distribution of Malaria clinical outcome . . . . . . . . . . . . . . . . . . . . . 31
4.2 Age distribution at Test Date . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Average monthly tests by age (years) . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Malaria tests done per Season Calendar . . . . . . . . . . . . . . . . . . . . . 33
4.5 Malaria tests done per province . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6 COVID-19 clinical outcome (raw dataset) . . . . . . . . . . . . . . . . . . . . 35
4.7 COVID-19 age distribution of the population . . . . . . . . . . . . . . . . . . 36
4.8 Frequency distribution of recorded symptoms on a log scale . . . . . . . . . 38
4.9 Classifier performance in ROC space . . . . . . . . . . . . . . . . . . . . . . . 39
4.10 Classifier performance in Precision-Recall space . . . . . . . . . . . . . . . . 41

7.1 Correlation Matrix for COVID-19 analytical dataset . . . . . . . . . . . . . . 57


x

List of Tables

3.1 Computational Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Malaria Dataset definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 COVID-19 Dataset definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Evaluation measures for the Confusion Matrix . . . . . . . . . . . . . . . . . 29

4.1 Descriptive Summary of Malaria Dataset . . . . . . . . . . . . . . . . . . . . 34
4.2 Descriptive Summary of COVID-19 Dataset . . . . . . . . . . . . . . . . . . . 37
4.3 Performance Metrics on Balanced data (percentage scores on out-of-sample

data). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Confusion Matrices for Malaria and COVID-19: Balanced Data . . . . . . . 39
4.5 Performance Metrics on Weighted data (percentage scores on out-of-sample

data) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.6 Confusion Matrices for Malaria and COVID-19: Weighted Data . . . . . . . 40
4.7 Performance metrics using Unary classification on Malaria data (percentage

scores on out-of-sample data) . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.8 Confusion Matrix from Unary classification: Malaria data . . . . . . . . . . . 42

7.1 Proportion of missing data - Malaria raw dataset . . . . . . . . . . . . . . . . 56
7.2 Proportion of missing data - COVID-19 raw dataset . . . . . . . . . . . . . . 57


xi

List of Abbreviations

AUC Area Under Curve

CV Cross Validation

DoH Department of Health

ICD International Classification of Disease

iForest Isolation Forest

KNN K-Nearest Neighbor

MCC Mathews Correlation Coefficient

ML Machine Learning

NHLS National Health Laboratory Services

NICD National Institute for Communicable Diseases

NMC Notifiable Medical Conditions

OCSVM One-Class Support Vector Machines

PCA Principal Component Analysis

PPV Positive Predictive Value

PR Precision-Recall

RDT Rapid Diagnostic Test

RF Random Forests

RIM Rule Interestingness Measures

ROC Receiver Operator Characteristic

SVC Support Vector Classifier

SVM Support Vector Machine

TPR True Positive Rate

WHO World Health Organization


1

1 Introduction

1.1 Epidemiology

Although there is a global decline in incident cases1, Malaria is still the most common
disease in Africa and globally [1] with the World Health Organization (WHO) estimating
405,000 deaths from 228 million clinical episodes in 2018 alone [2]. There have been global
efforts to accelerate the elimination of Malaria through improved diagnostic testing and
treatment especially in the WHO low and medium-income countries which have reduced
the incidence rates of Malaria [3]. However, the rates of decrease of Malaria incidence and
mortality are still low in countries with low resourced health systems and limited abil-
ity for system improvements [3]. The Global Technical Strategy for Malaria (2016 to 2030)
adopted by the World Health Assembly in 2015 aims at reducing Malaria-attributable cases
and deaths by ninety percent by 2030 through integrating active surveillance by interven-
tions [1]. Surveillance systems are effective in the elimination of parasites through tracking
Malaria transmission and pathways which focus on diagnosis, treatment, and prevention
resources [4].

COVID-19 is a highly transmitted disease that was first reported in Wuhan China in De-
cember 2019. This disease is caused by a zoonotic virus which was named severe acute
respiratory syndrome coronavirus 2 (SARS-CoV-2) [5]. Coronaviruses belong to the Coro-
naviridae family under the Coronavirinae subfamily, and they have been known to cause
several other infections in humans since the 1960s [6]. Globally, the disease has imposed
a great public health burden in many countries across the world due to its high transmis-
sion rate, with the WHO reporting over 113,000,000 cumulative confirmed cases and over
2,500,00 cumulative deaths as of March 2021 [7, 8]. Recent statistics indicate Africa is one
of the least affected continents with over 2.8 million cumulative confirmed cases and over
70,000 deaths, contributing 3% of the global cumulative COVID-19 related deaths [7] with
South Africa as the COVID-19 epicenter in Africa [9].

As part of the road map to eliminate diseases such as Malaria and COVID 19, countries
need to ensure improved testing and follow-up on infection rates [4]. As Basu and Sahi

1From 71 to 57 cases per 1000 population at risk between 2010 and 2018


Chapter 1. Introduction 2

[10] argue, early diagnosis and treatment reduce mortality rates and morbidity. Over the
past decades, there has been an evolution in the diagnostic testing techniques for Malaria
disease. Though widely adopted, diagnostic systems such as light microscopy and Rapid
Diagnostic Tests (RDTs) are dependent on biomarkers. However, RDTs and microscopy
are reported to have low sensitivity2 and specificity3 for Malaria [11]. However, research
still highlights significant challenges in the diagnosis of Malaria. Acting as parasite reser-
voirs [11], asymptomatic individuals fuel resurgence of the disease years after reported
treatment.

The current diagnosis methods for COVID 19 are laboratory-based and rely on biomark-
ers. Nevertheless, challenges associated with the diagnosis techniques exist [12]. These
include shortages of test kits and long waiting times for results among others. Much as
research continues to support the association of epidemiological profiles of disease with
environmental variability, the current COVID-19 diagnostic techniques do not account for
factors such as demographic information in diagnostic procedures. On the other hand,
diagnostic techniques that assess the patient’s signs4 and symptoms5 are reported to have
poor diagnostic properties especially among asymptomatic patients [13].

Subjective diagnosis of disease from symptoms is a vital component of disease surveil-
lance. Therefore, to aid clinical diagnosis, there is research and innovation in the use of
self-learning approaches to cope with these changing patterns. Commonly known as Ma-
chine Learning (ML), self-learning has been used to predict previously complex conditions
like cardiovascular diseases [14], obstructive pulmonary disease [15] among others. These
stochastic methodologies use a wide array of features to identify hidden patterns in the
data to predict disease outcomes. For example, by incorporating low-level features such
as texture to digitized human blood smear from slides, Khan et al. [16] used K-means clus-
tering to identify Malaria parasites with 95% accuracy. In the same way, using computer
vision, Molina et al. [17] used unsupervised ML to identify Malaria parasites. The ap-
proach yielded a sensitivity score of 100% with specificity at 90%. In a similar approach,
a 2018 systematic review by Poostchi et al. [18], the authors suggested that a well-defined
predictor should incorporate several factors such as characteristic of the microscope, type
of staining, slide preparation to image analysis in Malaria predictive approaches.

2The ability of a test to identify people with a disease usually expressed as a proportion
3The ability of a test to correctly identify people without a disease; the proportion of negatives that are

correctly identified.
4Any objective evidence of disease
5Subjective evidence of disease


Chapter 1. Introduction 3

Disease profiles are constantly changing and environmental variability is altering the ac-
cepted morphological appearances of species [19]. Therefore, for successful control and
eventual elimination of Malaria and COVID-19, more sensitive detection methodologies
that incorporate symptomatic information with laboratory markers are needed. As of the
year 2020, Malaria and COVID-19 accounted for the highest volumes of surveillance data
at the South Africa National Institute for Communicable Diseases (NICD)6. Nevertheless,
the institution uses deterministic approaches to predict classes/labels for Notifiable Med-
ical Conditions (NMC). This strategy implements pre-determined rule-sets upon selected
laboratory scores; an approach that is sensitive logically and rapidly becomes complex
with the addition of features. Therefore, in this research, we explore stochastic discrimina-
tive approaches as an alternative to deterministic methods in predicting disease labels for
Malaria and COVID-19. We constructed classifiers to accurately discriminate Positive and
Negative Malaria and COVID-19 cases from demographic, symptom, and laboratory data.
These classifiers could then be used for discriminative analysis to segregate probable cases
of Malaria and COVID-19 with new data.

1.2 Statement of the Problem

Growing data dimensionality is a real threat to deterministic/rule-set7 classifiers. Yet to en-
hance clinical diagnosis, it is necessary to look at a broad spectrum of data points/features
not only from the laboratory but also from non-laboratory markers such as symptoms. As
with most legacy systems, self-learning8 in rule-set classifiers is absent. In light of chang-
ing rule-sets, continuous learning and domain expertise become mandatory to keep such
classifiers relevant. To cope with changes in data structures, self-learning approaches in
ML become necessary.
We chose COVID-19 and Malaria for this research because of the high volume of data
readily available to experiment with this ML models.

1.3 Research Question

The questions of interest are:

1. Are supervised ML techniques better than rule-set approaches in the classification of Malaria
and COVID-19 at the NICD?

6A national public health institute located in Johannesburg, South Africa
7A set of human-crafted conditions to trigger a decision or choice. In computer science, knowledge is

presented and handled as logical rules implemented by an inference engine.
8The ability to recognize patterns, learn from data, and become more intelligent over time


Chapter 1. Introduction 4

2. What can be said of the current deterministic approaches that categorize Malaria and COVID-
19 in respect to current surveillance data profiles?

To answer these questions, in this research, we explore stochastic self-learning classifiers
using supervised ML prediction techniques in predicting disease status using Malaria and
COVID-19 surveillance data from the NICD. We also perform a comparative analysis of
current deterministic approaches and how it performs compared to novel ML classification
approaches to achieve a maximum reward. Therefore, the work of this project aims to:

1. Describe the current surveillance data structures and profiles for Malaria and COVID-
19 in South Africa.

2. Identify optimal ML algorithms that can be utilized to classify and predict Malaria
and COVID-19 clinical outcomes from available data structures.

3. Evaluate the performance of selected ML algorithms against the conventional rule-
set methods used to categorize disease status for Malaria and COVID-19.

1.4 Thesis Structure

The rest of this report is organized as follows: In Chapter 2, we explore general concepts
in Malaria and COVID-19 NMCs from case identification, clinical manifestations to treat-
ment. We also explore the current NMC surveillance approach at NICD. In Chapter 3, we
describe the various materials and methods used in this research with empirical data (re-
sults) presented in Chapter 4. Lastly, a discussion of the research findings is presented in
Chapter 5 along with conclusions from the research in Chapter 6.


5

2 Theoretical Background

In this section, firstly we briefly look at Malaria, in the context of disease manifestation and
management while highlighting ideal parameters to aid clinical diagnosis. Secondly, we
give an overview of the NMC surveillance process as conducted by the NICD. Herein, we
also covers specific key concepts and constructs that inform the direction of this research.
Some of the questions addressed include; what approaches are available to address the
objectives stated in Chapter 1 and what classifiers are available for this task and how to
determine optimal classifiers. To answer these questions, we begin by exploring classifi-
cation approaches relating to this research and what alternatives are available to answer
these questions. The last section explains the performance metrics that are available to aid
model selection.

2.1 Overview of NMC Surveillance in South Africa

Globally, Monitoring and Evaluation (M&E) programs are used to collect health-related
data [20]. These data are then used to track progress towards targets, assess the impact
of current health interventions and the WHO goals of morbidity control and elimination
[21, 22]. NMC surveillance is a vital process in providing necessary information to timely
and accurately detect public health threats. The National Department of Health (DoH) of
South Africa defines NMC as diseases that are of public health importance [23] because of
the risks they pose. As illustrated in Figure 2.1, this reporting follows an upward cascade
starting at the Health establishment level, Sub-District, or District level then to the national
system.

It is a legal obligation for all health practitioners to report diseases classified as NMC to the
DoH. At the NICD, NMC reporting timelines vary depending on severity. The National
Guidelines for the Treatment of Malaria in South Africa 2019 require all Category-1 condi-
tions be reported within 24 hours of first diagnosis irrespective of laboratory confirmation
[24]. For Category-2, within 7 days of receipt of laboratory confirmation; within 7 days of
diagnosis for Category-3 [23] and up to one month of diagnosis for Category-4.


Chapter 2. Theoretical Background 6

Figure 2.1: NMC Reporting Cascade

2.2 Disease Manifestation and Management

2.2.1 Overview

In South Africa, Malaria is regarded as a category-1 NMC and therefore, must be reported
within 24 hours of first diagnosis irrespective of laboratory confirmation [23]. With Light
Microscopy using Giemsa-stained thick/thin blood smears as the yardstick [25, 26], sev-
eral diagnostic methods have been adopted to support these global efforts to reduce and
eventually eliminate Malaria [27, 28]. However, not all standards are ideal especially in
Malaria-endemic areas and affordable Point-of-Care diagnostics (Rapid Diagnostic Tests)
have been reported to have differing sensitivity and specificity [29, 30, 31]. It is argued that
in South Africa, malaria is mainly transmitted along the border areas with some parts of
South Africa’s nine provinces (Limpopo, Mpumalanga and KwaZulu-Natal) endemic for
malaria[32]. In the Figure 2.2 is an illustration of the disease severity in South Africa.


Chapter 2. Theoretical Background 7

Figure 2.2: SA Malaria Risk Map December 2018. Image credit: DoH SA

Diagnosis of COVID-19, a Category-1 NMC is based on biomarkers that are related to the
organisms that cause disease. The United States Center for Disease Control and Prevention
recommends two types of tests; a viral test that detects current infection and an antibody
test that detects the previous infection. The approved assays used for testing detect either
COVID-19 nucleic acid or antigen in the upper or lower respiratory specimen which are
either the oral or nasal swabs to determine whether an individual has COVID-19 or not
[33].

2.2.2 Case Identification

Almost all Malaria deaths are caused by Plasmodium falciparum [34] with pregnant women,
older persons, children under 5 years, and those with co-morbidities at greater risk. Symp-
tomatically, uncomplicated Malaria is known to cause fevers and chills, headache, and gen-
eral body weakness in those infected by the parasite. Left unattended, the disease may
rapidly progress into Severe with patients exhibiting one or more conditions including very
low blood glucose levels, low haemoglobin1, pulmonary oedema, renal failure, breathing

1Less than 50 g/L (5 g/dL)


Chapter 2. Theoretical Background 8

distress, relaxed blood pressure2, convulsions, and sometimes multisystem failure among
others.

The most common symptoms at onset COVID-19 are fever, cough, and myalgia or fatigue
while the less common symptoms are sputum production, headache, hemoptysis, and di-
arrhoea [35]. However, studies continue to indicate many patients with COVID-19 either
do not manifest any symptoms or registering mild symptoms of the disease and these
cases spread the virus to other non-infected persons [36]. These COVID-19 Asymptotic
cases increase complexities in active surveillance especially and screening or classification,
a factor impeding efficient prevention and control of the disease. Studies have indicated
the relationship between the risk of infection and comorbidities. There is an increased risk
of COVID-19 infection especially in persons with pre-existing conditions [37, 38]. It is re-
ported diseases such as hypertension, diabetes, and respiratory disease are more prevalent
among fatal cases [39].

2.2.3 Eradication Efforts

Eradication of Malaria requires a multi-disciplinary effort from the active treatment of
asymptomatic cases [40] to socio-economic improvement [41]. With the Malaria vaccine3 in
the trial phase [42], artesunate-based medications are still the WHO recommended stan-
dard treatment for both uncomplicated and severe Malaria in humans. Without proper
management, falciparum Malaria is known to persist in some individuals several years af-
ter leaving Malaria-endemic areas [43]. Therefore, any individual who has a fever and has
been to a Malaria-endemic area is at risk. Galatas, Bassat, and Mayor [40] argue symp-
tomless cases that persist fuel transmission. Therefore, to minimize missed diagnosis of
sub-clinical Malaria, a high index of suspicion is required [44].

Incorporating patient demographic information with reported symptoms and laboratory
markers are essential for more accurate results. For example, Luo et al. [45] showed that
incorporating patient demographics and laboratory results provided a powerful discrim-
inant for ferritin. In a 2017 study, with 8-features from the Kenyan Malaria Indicator Sur-
vey data, Rajpurkar, Polamreddi, and Balakrishnan [46] proposed a deep learning agent
to predict the likelihood of one testing positive for Malaria using individual demographic
characteristics. In the same way, in a 2020 study by Lee, Choi, and Shin [47], six ML mod-
els were compared using patient clinical information to predict Malaria. In this study, by
incorporating ones’ nationality as a demographic characteristic alongside recorded symp-
toms, the Random Forest yielded the best scores with an accuracy of 90.3% (AUC = 73.2%)

2Less than 70 mmHg in adults and 50 mmHg in children
3RTS,S/AS01 (RTS,S)


Chapter 2. Theoretical Background 9

In South Africa, COVID-19 eradication efforts so far are geared towards preventive ap-
proaches to stop the further spread of the virus. As of April 2021, these measures have been
boosted with vaccination roll-out starting with the most at-risk populations. As Huang et
al. [35] asserts global efforts to control and eventually eliminate COVID-19 still lack early
detection methods. These approaches should include improved methods for prediction
and classification of the disease to reduce the transmission and improve patient survival
rate.

Different ML algorithms have been applied in the prediction and classification of COVID-
19. For example in a 2020 study, Hamed, Sobhy, and Nassar [48] employed a KNN variant
algorithm to determine COVID-19 disease classification using incomplete heterogeneous
data. The experiments showed KNN-variant algorithm outperformed both the modified
KNN and standard KNN on the accuracy, precision, recall, and F1 scores performance
metrics. Moreover, in a similar study, Iwendi et al. [49] reported the Boosted RF algorithm
as an optimal predictor for COVID-19 where data was imbalanced4.

2.3 Classification

In this study, we theoretically define classification as relating to the possible outcome of
events occurring in a finite space i.e belonging to a specific category. However, this concept
may not be interpreted the same way as the widely adopted WHO International Classifi-
cation of Disease (ICD) which is essentially a list of Causes-of-Death to inform mortality
and morbidity statistics [50].

James et al. [51] define classification as predicting a qualitative5 response for an observa-
tion. Given an instance, classification algorithms induce predictive rules based on features
and patterns in the data to predict classes with predicted labels assuming a minimum of
two levels [52]. These classifiers employ statistical and computational models to segregate
datasets into categories. As an example, an algorithm that distinguishes kidney function-
ality based on their estimated glomerular filtration rate "Severe/Abnormal/Normal" in pa-
tients can be regarded as a Quaternary-classifier; the outcome result must belong to only
one level i.e severe, moderate, abnormal or normal.

There exist far more complex classifiers for example document categorization algorithms
that scavenge thousands of topics and group them into themes. Although more than one
class may be considered, for simplicity, this research focuses on Binary Classification i.e.

4A dataset where the class distribution is unequal
5Often referred to as categorical variable


Chapter 2. Theoretical Background 10

Presence or Absence of disease. In this section, we explore both Rule-based and ML classi-
fiers.

2.4 Inferential Classifiers: Rule-sets

Rule-set classifiers rely on a set of predetermined inferential rules to determined classes.
These inference rules are set into the system to churn data into discreet prediction out-
comes [53] i.e. absence or presence of disease. Theoretically, in rule-based methodologies,
there is no limit on rules applied. However, as with any other strategy, these classifiers
do not come without challenges. The approaches are characterized by inconsistencies, dif-
ficulty in maintaining business rules, and long load times, among other drawbacks [54].
Because of this subjective nature, there is always a trade-off between complexity in deci-
sion logic and accuracy in prediction outcomes.

Rule-set classifiers adopt inductive logic programming where each rule consists of a prior
condition sometimes called an antecedent and a consequent/resultant. These classifiers
take the form

i f LEFTthenRIGHT (2.1)

The rules dictate if the "LEFT" hand-side of the rule is satisfied, it should imply the "RIGHT"
hand-side which in this case is the class label we are predicting. In practice, Rule-based
classifiers take into account all the rules to determine their performance. To estimate Rule
Quality, we use Rule Interestingness Measures (RIM) to distinguish between rules. This
is an area still under research with no standard notations available yet [55]. Moreover,
Piatetsky-Shapiro [56] proposes three criteria every RIM should satisfy.

1. The measure should be Zero if NBoth = (NLe f t x NRight)/NTotal

2. The measure should increase monotonically with NBoth

3. The measure should decrease monotonically with each of NLe f t and NRight

where
NLe f t : Count of instances matching LEFT
NRight : Count of instances matching RIGHT
NBoth : Count of instances matching both LEFT and RIGHT
NTotal : Total number of instances


Chapter 2. Theoretical Background 11

2.5 Machine Learning Classifiers

Because of their ability to adapt, learn and continuously improve, ML algorithms are in-
creasingly being used to make predictions in critical contexts [57] where the main goal is
to maximize generalization i.e the ability to classify new data [53] previously unexposed
to the classifier. These algorithms can pass data, learn from it and apply the newfound
knowledge to make intelligent decisions. This is achieved by creating mathematical func-
tions that relate input to desired output with differing complexities. On a broad scale,
the algorithms are organized into a taxonomy based on the desired outcome [58]. In this
section, we briefly describe the three general categories.

2.5.1 Unsupervised Learning

Sometimes, we are not fully aware of what features (X) should inform modeling solutions
to classify a target/outcome (y). Our goal is then to explore the data and discover interest-
ing patterns and properties [59] in the data as opposed to prediction. These learning meth-
ods are termed Unsupervised Learning. Techniques such as Principal Components Analysis
6 and Clustering7 are typically used to provide labels (clusters) or values (rankings) [60]
before supervised techniques are applied. Unsupervised learning is by far a subjective
process and for this reason, can be hard assessing performance from these approaches. As
James et al. [51] argue, there is no universally adopted mechanism to validate results from
an independent dataset.

2.5.2 Semi-supervised Learning

A variation of Unsupervised Learning, Semi-supervised Learning is sometimes the appropri-
ate choice especially when the datasets contain only a small portion of labeled data. Using
ensemble methods8, the algorithms then generate annotations for the unlabelled data to
quantities large enough to appropriately train the models. In principle, the bootstrapping
process employs a supervised learning approach to classify these unseen data. To evaluate
these classifiers, it is worthwhile having genuine annotated data for evaluation [60].

2.5.3 Supervised Learning

These learning algorithms are ideal for discreet outcomes i.e. the underlying output vari-
able can only assume one of two states such as diseased or not-diseased (binary classi-
fication). Algorithms such as decision tree induction, SVM, KNN, RF [61] among others

6A tool used for data visualization or data pre-processing
7A broad class of methods for discovering unknown subgroups in data
8A set of classifiers whose individual decision are combined in some way to classify new examples


Chapter 2. Theoretical Background 12

provide mechanisms to learn from annotated data and make predictions on new data. All
these algorithms exhibit unique strengths and largely depend on the data quality and task
at hand.

In supervised learning, we assume a functional relationship exists between input and out-
put. Let {x, y} be a set of attributes where y is the class label of instance x, then the at-
tributes for disease (D) classification will be a set of dependent variables together with the
clinical diagnosis consisting of Positive cases (D+) and disease Negative cases (D−). In
other words, the algorithm assumes the form D+ ⋂

D− = ∅ where the output has been
labeled a priori [62] i.e. there is some knowledge of the data.

2.6 Estimating Classifier Performance

To measure how well a model performs, it has to be evaluated on specified metrics. This is
done by subjecting the algorithm to data previously unused in the training process or em-
ploying other proven schemes. A common approach is to split the data into chunks with
80% for training, 20% for testing. This is done to determine how accurately our predicted
classes match the known labels in the evaluation set [60].

2.6.1 Optimization

However, as with most ML tasks, splitting data is hampered by the number of observations
to apportion for training, testing and evaluation. As a result, some of the data are likely
to be used both during model training as well as testing at the same time. This situation
is sometimes called contamination and is likely to result in invalid estimates. On the other
hand, not all features are important to train models. Sometimes less-important attributes
may be used to fit classifiers [51]. This allows models to learn non-linear patterns in the
data leading to high variance. This behavior is also called model over-fitting and usually
happens where a model performs well on training data and suboptimal on out-of-sample
data.

On the other hand, poor performance during training may yield better results during out-
of-sample data testing. In this case, the model is said to underfit the data. Model-misfit
(under-fitting and over-fitting) is often a problem in predictive analytics and require atten-
tion. One way to address model-misfit is to account for the unequal distribution of classes.
In this approach, a classifier is fit with distribution weights of the target specified as hyper-
parameter. Another alternative is to resample the data to obtain equal representation in the
target [63].


Chapter 2. Theoretical Background 13

Another approach involves using robust approaches like Cross-Validation (CV) with K-
Fold and Grid Search strategies. CV provides a method for evaluating how good a model
fitted generalizes new data. With K-fold CV, the data (X, y) is randomly split into disjoined
K 9 subsets. The classifier is then iteratively trained using every single bin as testing data
with the rest (K - 1 subsets) as training data after which the average performance is deter-
mined. In the Grid Search technique, a set of all possible combinations of settings specified
in the parameter grid are iteratively passed to a model using the CV strategy. After this it-
erative process, settings that yielded the highest scores from the validation are returned for
model specification and generalizability. Given that the technique can be computationally
intensive, it is highly dependent on performance metrics from the K-fold CV optimizer.

Notation: Assume a labeled dataset (X, y) with an input matrix X of n x m dimension and
output vector y of nx1 dimension, fitting a statistical model p which given the i-th sample
from X can predict the i-th element in y. Now the goal is to fit p such that for new input
Xi we are still able to predict yi.

2.6.2 Model Selection

For this research, we used two common techniques in predictive analytics i.e the Confusion
Matrix10 also called the Error Table and AUC. In addition to the confusion matrices and
the AUC, we also visually represented the classifier performance using the Receiver Op-
erator Characteristic and the Precision-Recall (PR) curves. These are further explained in
Section 3.9.

9K denotes a positive integer greater than 2. usually 10 is appropriate
10A special kind of Contingency Table with two rows and two columns that reports the number of false

positives, false negatives, true positives, and true negatives. https://en.wikipedia.org/wiki/Confusion_matrix


14

3 Materials and Methods

In this chapter, we describe the study population from which data were drawn along eth-
ical considerations guiding the research. We also systematically address the procedures
taken adopting specific supervised ML algorithms to address the objectives stated in Chap-
ter 1, from model specification to evaluation criteria.

3.1 Approach

This is a non-population-based retrospective study that analyses secondary Malaria data
collected over a 5-year period (January 2015 to December 2019), and COVID-19 data col-
lected over one year (March 2019 to March 2020) by the National Health Laboratory Ser-
vices (NHLS), South Africa.
The study utilizes pre-processed data generated or accumulated through suspected-case
notification systems informing NMC surveillance from point-of-care/health facilities as
well as laboratory results originating from sample testing.

3.2 Study Site

This research was conducted at the NICD, a division of the National Health Laboratory
Services1 located at Sandringham, Johannesburg South Africa.
Located on the southern tip of the African continent between 29◦00‘S and 24◦00‘E, South
Africa experiences a varied climate throughout the year with colder months between June
to August with the warmer months from December to February2.
Covering a total area of 1,219,602km2, the country’s landscape ranges from the lowvelds
and bushvelds of Limpopo and Mpumalanga, the highvelds of Gauteng and Free State,
the Eastern Highlands of Kwazulu-Natal and parts of Eastern Cape, the great Karoo of
Western Cape to the Bushland, Namaqua Lands and Griqualand of Western and North
Western and Northern Cape[64].

1The largest diagnostic pathology service in South Africa
2Geography and climate | South African Government. (n.d.). Retrieved October 19, 2021, from

https://www.gov.za/about-sa/geography-and-climate


Chapter 3. Materials and Methods 15

3.3 Study Population and Data Sources

These data composed of clinical diagnosis, recorded vital signs and symptoms, reported
risk factors, laboratory results, and demographic information from suspected cases in both
migrant and static populations as they presented at point-of-care facilities. Owing to to-
pographical and socio-economic differences, we selected data from all districts and sub-
districts accumulated from all the nine provinces stored in the Surveillance Data Ware-
house at the NICD .

Malaria:
Malaria data used for this research was made up of three Comma Separated V files. De-
scribed below, these files were linked together using episode_no as key-field.

• MalariaDemographics - consisting of clinical notification data (cases identified through
the NMC app) with 222,805 unique observations from 10 variables

• MalariaResults - A repeated measures file with 766,074 observations from Laboratory
tests

• MalariaExtra1 with 40094 observation and 20 attributes (excluding key-field) con-
taining observed symptom along with treatment information, records of travel (in-
cluding dates) and contact history.

COVID-19
This was a unit dataset (COVID-19.csv) with 35202 observations from 25 variables, exclud-
ing episode number (key-field). This datafile contained patient demographic information,
recorded signs and symptoms, and reported comorbidities.

3.4 Computational Environment

While there exist alternative Integrated Development Environments that yield the same re-
sults, the pros and cons associated with them is a subjective topic. However, we chose and
setup our computational environment using both licensed commercial and open-sourced
BSD licensed tools. These tools and utilities were hosted using Microsoft Windows Oper-
ating System. In the Table 3.1 below is a full listing of resources used.

3.5 Conceptual Framework

We conformed with the agile software development methodology and adapted the gener-
ally acceptable framework for supervised ML illustrated Figure 3.1 below. In focus, we


Chapter 3. Materials and Methods 16

Table 3.1: Computational Environment

Utility Version Notes

PANDAS 0.24.2 For high-performance data structures and analysis tools
SKLEARN 0.24.1 Tools for predictive data analysis
SEABORN 0.9.0 Python data visualization library based on matplotlib3

NUMPY 1.16.2 Numerical library to facilitate the data management process
PYTHON 3.7.4 Programming language. Based on the Anaconda Integrated

Development Environment (IDE)
STATA 15.1 Statistical Packaged for analysis (IC Edition)
COMPUTER 10 Pro Intel® Core™ i5 2.3GHz Processor, 16Gb memory 64-bit Mi-

crosoft Windows Operating System

SKLEARN includes class libraries for ML models. Source: https://scikit-learn.org/stable
SEABORN based on MATPLOTLIB graphic library
PYTHON, interpreted object-oriented programming language. Source: www.python.org/
STATA statistical software with annual updates. Source: www.stata.com/

tackled aspects of the iterative process that drives the development of ML models. Key
considerations included volume and nature of data used, distributions in attributes, in-
dustry standards and approaches, assumptions to decisions taken among others.

Figure 3.1: A conceptual framework for Supervised Machine Learning;
adapted from various internet sources

3.6 Preprocessing

During the data-extraction phase, key vectors that will define the dataset and tune the
algorithm are deduced, cleaned and standardized. Because Malaria data were received as


Chapter 3. Materials and Methods 17

three separate files uniquely identified by episode numbers, we concatenated them using
the Inner Join4 strategy to obtain a single entity. An insight into the pre-processing is
summarised below:-

3.6.1 Curation

In an ideal world, data is clean and ready for analysis. However, this is not always the
case. Real-world data are messy. Adopting normalization approaches as employed in re-
lational databases, Wickham [65] proposes the tidy-data model where every variable forms
a column, each observation forms a row and each type of observational unit forms an en-
tity. There are interesting proposals in the literature regarding data tidying. However,
proposed methodologies may not necessarily be applicable in all situations as datasets
usually differ.
In this research, data were received as flat tables, with features clearly defined by columns
and rows denoting observations. Checks for missing values and transcription errors were
done and where possible corrected from referenced features. String values were encoded
by mapping feature schemas5 and where possible data re-coded with appropriate data
types enforced. Correlation matrices were used to identify probable patterns and relation-
ships between attributes.

To detect peculiarities in the data, exploratory data analysis was done using distribution
plots to detect peculiarities6 values in the data using the notation below.
Let Tk denote the threshold value for a certain feature K that follows a skewed distribution,
Tk = 1.5 ∗ IQR7; a value is said to be peculiar if it falls either below or above the threshold
Tk. Consequently, these instances were dropped from the dataset. An example from the
Malaria data is shown in Figure 3.2 below where we noted 1544 probable erroneously
recorded ages i.e above 140 years. These implausible data did not fit within normal limits
and were consequently not considered for analysis.

3.6.2 Data Definition

We filtered and subjected only records with information across the two datasets to the data
cleansing and feature engineering process as described in subsection 3.6.3. An observation
was considered candidate for use if and only if a laboratory test record was successfully
linked in the demographic-information dataset via an episode number.

4A join operation in relational algebra - combining from entities in a relational environment (wikipedia)
5Categorical classification of a variable
6Out-of-range
7Interquartile Range also called Midspread is a statistical measure of dispersion


Chapter 3. Materials and Methods 18

Figure 3.2: Kernel Density Estimate plot for age at testing (years)

From this iterative process, we deduced feature vectors to inform approaches at particular
steps in ML. As Jutte, Roos, and Brownell [66] assert, the process requires extensive re-
sources to assemble key indicators. In our research, two data files were used; Malaria and
COVID-19.

Malaria:
This dataset consisted of laboratory markers from laboratory measures as well as demo-
graphic attributes informed by case-notification data. These data included test methods,
clinical symptoms, specimen measures parasite and cell counts, test dates, triage informa-
tion like episode number, admission status among others. The aggregated dataset contained
37 features. Of the 216,408 observations, more than half the predictor variables had over
80 percent missing information. These predictors could neither be imputed nor used for
model specification. Consequently, these missing data were dropped from the final ana-
lytical dataset as shown in Figure 3.3 below. A missing values report is shown in Table 7.1
annexed in Appendix 7.

In the table below is a high-level description of the Malaria analytical data used with a
descriptive summary presented in Section 4.1.

COVID-19:
On the contrary, in deducing the COVID-19 analytical dataset, we did not drop missing


Chapter 3. Materials and Methods 19

Figure 3.3: Preprocessing flow - Malaria dataset.

Table 3.2: Malaria Dataset definition

sn Variable DataType Description
1 Target String A laboratory confirmed malaria test result
2 Gender String Participant recorded gender
3 in_patient string Participant hospitalization status
4 age_tested_years Integer Participant recorded age (in years) at time of

malaria test
5 red_cell_count float Red Blood Cell (RBC) count from laboratory
6 weather string Calendar season deduced from the South

Africa meteorological calendar
7 district_name string District where test was done
8 province string Province reporting malaria test-result

data. Using regular-expression text processing techniques, we inferred 14 features; 13 bi-
nary and 1 continuous from 24 candidate attributes in the raw data. A missing value report
is annexed in Table 7.2 of Appendix 7. To inform probable symptoms, these features were
categorized to fall in either of the following groups:- fever/chills, cough, sore throat, shortness
of breath, diarrhoea, muscle/joint pains, malaise, fatigue/lethargy, influenza, and vomiting/nausea.


Chapter 3. Materials and Methods 20

Figure 3.4: Preprocessing flow - COVID-19 dataset.

Because of inconsistencies in which this information was captured, a pooled indicator Co-
morbidityYN was created to indicate a ‘Yes‘ - 1 if any comorbidity were registered or ‘No‘
- 0 if neither. Below is a description of the COVID-19 dataset with descriptive summary
statistics provided in section 4.1.

3.6.3 Feature Selection and Engineering

In predictive analysis, not all features in a dataset are important for classification and pre-
diction. Yet, there is no one-size-fits-all method to this task. One approach is to use unsu-
pervised statistical techniques like Principal Component Analysis (PCA) to attain this. In
this research, we employed domain knowledge, a manual dimension-reduction technique
to carefully select principle features for the task. This search problem aimed at minimizing
collinearity and model misfit by removing correlated features and noise.

Ensemble Selection was performed on Categorical attributes Sub-District, Province. Pro-
posed by Niculescu-mizil et al. [67], we limited OneHotEncoding to the top-10 levels for


Chapter 3. Materials and Methods 21

Table 3.3: COVID-19 Dataset definition.

sn Variable DataType Description
1 Target String A confirmed PCR COVID-19 test result
2 Age Integer Participant recorded age (in years) at time of

COVID-19 PCR test
3 Gender String Participant recorded gender
4 Fever/Chills boolean Deduced symptom from statements infer-

ring absence/presence of fever
5 Cough boolean Deduced symptom from statements infer-

ring absence/presence of cough
6 Sore Throat boolean Deduced symptom from statements infer-

ring absence/presence of sore throat
7 Shortness of

Breath
boolean Deduced symptom from statements infer-

ring absence/presence of shortness of breath
or difficulty in breathing

8 Diarrhoea boolean Deduced symptom from statements infer-
ring absence/presence of diarrhoea

9 Joins/Muscle
Pains

boolean Deduced symptom from statements infer-
ring absence/presence of joint and muscle
pains

10 Malaise boolean Deduced symptom from statements infer-
ring absence/presence of malaise

11 Fatigue/Lethargy boolean Deduced symptom from statements infer-
ring absence/presence of fatigue or lethargy

12 Influenza boolean Deduced symptom from statements infer-
ring absence/presence of influenza, com-
mon cold and sneezes

13 Vomiting/Nausea boolean Deduced symptom from statements infer-
ring absence/presence of vomiting or nau-
sea

14 ComorbidityYN boolean Deduced from statements indicating ab-
sence/presence of any underlying comor-
bidity

these categorical attributes. By mapping categorical data onto a binary scale, the OneHo-
tEncode process involves converting each categorical value into distinct attributes consist-
ing of ‘1‘ or ‘ 0‘ denoting a presence or absence of a level. Demographic variables Hospi-
talization Status: ’Y’, ’N’ and Gender: ’M’, ’F’ were label encoded to 1, 0 denoting a Positive
and Negative response respectively. For both datasets, the target (dependent/identifier)
variable was classified as binary with ‘1‘ and ‘0‘ denoting an observed positive and negative
clinical outcome respectively.

Computational and time complexities were minimized by standardizing all features on a
continuum to fit between Zero to One using the MinMaxScaler implementation through


Chapter 3. Materials and Methods 22

the SKLEARN library. This is denoted by:

Xscaled =
X − Xmin

Xmax − Xmin
(3.1)

where
Xscaled : the new transformed vector
X : the vector instance to transform
Xmin : the minimum value of X in the vector domain
Xmax : the maximum value of X in the vector domain

To minimize data leakage, the training and out-of-sample datasets were engineered in-
dependently through SKLEARN fit-transform methods. Because of the skewness in dis-
tribution, missing values in categorical features were imputed with the most-frequent ob-
servations. Because of the adequate trade-off between precision of imputation without
distorting the structure of the data [68], features on a continuum were imputed using the
Nearest Neighbor strategy.
In the COVID-19 analytical dataset, we probed for possible patterns with missing data
though we did not find any positive findings. We summarily concluded that missing data
was completely at random. We therefore imputed missing values for age using the Nearest
Neighbor imputation strategy and gender using the most frequent strategy [69].

3.7 Model Specification

In the model specification phase, the training dataset is used to learn the underlying al-
gorithm [70] before iteratively parsing it until a best-fit model that predicts the outcome
is derived. This stage involved modeling and optimizing to achieve scores that can be
generalized. After preparing the data (data preparation phase), the classification models
and their unary derivatives were then subjected to an iterative split-train-predict process
to obtain generalized scores. The process is briefly explained below.

3.7.1 Splitting

For this research, generally acceptable ML data partitioning schemes are applied. Firstly,
the datasets are divided into two portions, stratified on the Target variable (clinical out-
come) with (1/4) as out-of-sample and the rest (3/4) for model specification.


Chapter 3. Materials and Methods 23

3.7.2 Classification Strategies

In the classification and prediction of COVID-19 and Malaria clinical outcomes, we used
three strategies. Firstly, we under-sampled the majority class to obtain equal representa-
tion of clinical outcomes for both positive and negative groups. In the second approach,
we accounted for the distribution imbalances in the data in the classification and predic-
tion of clinical outcomes. The third approach, a novelty technique involved classifying
the minority category as outliers and then predicting labels that presumably belong to this
class using out-of-sample data.

3.7.3 Hyper-parameter selection

To define a generalized model that can be deployed on out-of-sample data, we defined
initial possible parameters for the predictions. Using, a stratified Parameter Grid search
approach, the models were continuously deployed and refactored on all possible combi-
nations using a 5-fold Cross-Validation scheme. Parameter combinations yielding the best
F1-Score were identified and subsequently re-fit to define a generalized model. This ap-
proach was repeated for model specification on under-sampled (balanced) data and mod-
els accounting for distribution weights.

3.7.4 Support Vector Machines

SVM models are trained to distinguish and segregate all instances of a unit class versus
the rest. Using the concept of margins8, SVM aims to find an optimal hyperplane that
best separates the data groups [53]. In an ideal scenario, data is separable i.e. positive
cases distinguished from negative cases. We can then draw a hyperplane to achieve this
by identifying support vectors9 to distinguish the separation. SVM approaches seek to
predict labels by finding parameter functions that maximize the margin. This classifier
naturally avoids over-fitting and bias problems by choosing a fitting less complex function
yielding minimum training errors; a technique called regularization. In Figure 3.5 is a
high-level illustration of the SVM classification technique.

Notation: Assume a set of N training examples that can be identified as belonging to either
class say T and F. Let these examples take on values +1 and −1 respectively with each
data point xi having several of attributes K. Then, the training data takes the form (xi, yi)

where i = 1, ..., N and yi ⊆ {+1,−1} and x ⊆ RK where R is an instance of K as illustrated
in Figure 3.5.
This implies yi = +1 if xi ⊆ T and yi = −1 if xi ⊆ F. For linearly separable data, the

8The distance between a hyperplane and the closest points; also called support vectors
9Data points on the lines that define the margins


Chapter 3. Materials and Methods 24

Figure 3.5: SVM classifier a case of linearly separable data

hyperplane assumes the function
wTx + b = 0 (3.2)

where
w = Normal vector that is perpendicular to the plane
b = Bias determining point location relative to the origin.

Then, SVM searches for a separating hyperplane by maximizing 1
|w| . Therefore, new data

points xi can then be classified using the decision rule expressed as

f (x) = sign(wTx + b) (3.3)

For linearly separable data, SVM introduces a slack variable εi for the constraints and
assigns a penalty to data points on the wrong side of the hyperplane. A general expression
of the function for linearly separable data can then be expressed as

∀iyi(wTx + b)− 1 + εi ≥ 0 (3.4)

However, there are instances when the data can only be separated by a curved decision
boundary i.e not separable linearly. In such cases, SVM assumes a soft margin to re-classify
data points that were wrongly classified and introduces a penalty C for this misclassifica-
tion by using the linear separation technique. The SVM then assumes a function

min
1
2
|w|+ C

N

∑
i=1

∀iyi(wTx + b)− 1 + εi ≥ 0 (3.5)

Being a classification task we adopt the Support Vector Classifier (SVC), a variation of


Chapter 3. Materials and Methods 25

the SVM to classify and predict Malaria and COVID-19 clinical outcomes. Furthermore,
because of time complexities involved with the algorithm, the kernel hyper-parameter was
set ’linear’.

3.7.5 The k-Nearest Neighbor Method

This is a non-parametric classification10 technique that relies on readily available data to
predict classes on new data. Novel labels are predicted based on properties shared with
other data points i.e an object is classified by the popular vote of its neighbors (K). KNN is
dependent on the distance function used to measure similarity [57] between data instances
i.e. proximity of K data points.
With K=1, all new data points will be classified according to properties shared by their
immediate neighbor. However, misclassification errors (false positives and false negatives)
arise with larger values of K. For example, a Malaria-positive case may be labeled negative
if the dataset contains the majority of such (negative) cases. Therefore, The optimal size of
K is the one that minimizes the classification error. Illustrated in Figure 3.6, instance ’N’
will be classified by defined characteristics of either 3 or 7 nearest Neighbors. With KNN,

Figure 3.6: KNN classifier

proximity is estimated by the Euclidean distance function

D(x, y) =

√
m

∑
i=1

(xi − yi)2 (3.6)

Where x = x1, ..., xm , y = y1, ...yn and and m = the attribute values of two points x and y

10Methods that make statistical inferences without regard to any underlying distribution.


Chapter 3. Materials and Methods 26

3.7.6 Decision Tree Learning: Random Forests

These use the concept of decision trees to classify data into discreet classes by using a set
of rules. These set of rules iteratively split the data on key attributes i.e characteristics
that most separate the data until further splits are not more informative. A decision tree
cascades a set of rules, where each rule may infer either another rule or lead to a decision.
Recursive algorithms like Iterative Dichotomizer 3 are used to construct decision trees. In
the analogy of a forest, numerous decision trees collectively form Random Forests.

In the learning process, at each node, the algorithm gains information that best discriminates
labels. A form of dimensionality reduction, information-gain can also be used for feature
selection where each candidate feature is evaluated in the context of the target. This in-
formation is then used by the child node in a cascade manner until a decision is reached.
Figure 3.7 illustrates the RF classifier.

Notation: Assume a dataset S, with attribute A, let Sv denote an instance/subset of the
data; Sv ⊂ S. Also let A = v and ValuesA is the set of all possible values of A, then
Information gain11 can be expressed using the function below

Gain(S, A) = Enthropy(S)− ∑
vϵValues(A)

|Sv|
|S| Enthropy(Sv) (3.7)

where Entropy is the measure of uncertainty of a random variable A

Figure 3.7: Decision Tree classifier branch

11A measure of this change in entropy


Chapter 3. Materials and Methods 27

3.8 Novelty Detection Approaches

The ML classification models discussed so far have one common assumption; the data to
some degree has equal representation in the outcome hence balanced. However, this may
not always be the case. Epidemiological studies have shown certain morbidities occur less
frequently than others. In such cases, the surveillance and/or clinical datasets tracking
these conditions are by far more likely to be biassed towards less frequent occurrences. In
ML, scarcity of a particular class label in a prediction dataset is what defines anomalies
[71].
Anomalies have two distinct quantitative properties; consist of fewer instances in the
dataset and have peculiar characteristics (data values) compared to the majority (normal)
instances. In principle, novelty detection focuses on the identification of abnormal patterns
from large amounts of normal data [72]. As Sun, Wong, and Kamel [73] argue, classifica-
tion rules that predict the small classes tend to be rare or undiscovered. As a result, out-
of-sample data belonging to small classes are likely to be misclassified compared to those
belonging to the majority class. Under those circumstances, deviations from the ’normal’
provide an alternative to model the minority class as anomalies, a technique sometimes
called Novelty detection of Unary/One-class classification.
Whereas kernel optimization methods, statistical approaches, and Neural Networks [74]
amongst other strategies are available for anomaly detection, in this research, we focus on
two strategies i.e iForest and OCSVM.

3.8.1 Isolation Forest

Proposed by Liu, Ting, and Zhou [75], the iForest algorithm recursively partitions in-
stances randomly until all possible instances are completely isolated. The approach uses
a Binary Search Tree algorithm to construct isolation Trees (iTree) from randomly selected
attributes, collectively forming an iForest.

Assume X = {x1, ..., xn} is a sample of the data with N training examples, an iTree is
constructed by recursively splitting X by randomly selecting an attribute q and split value
p until (a) the tree has reached a height limit or (b) |X| = 1 or (c) all data in X have the
same values. This approach creates shorter paths from the node and is independent of
distance or density measures.


Chapter 3. Materials and Methods 28

3.8.2 OneClass SVM

In OCSVM, data presumed to originate from the normal class is used to train the support
vector model, after which the model is tested on contaminated12 data to ascertain perfor-
mance metrics for the segregation. Originally developed for two-class classification tasks,
extensions and enhancements to the SVM like Support Vector Data Description [76] and
Local-Density OCSVM [77] have been proposed with empirical results suggesting a better
performance of the classifiers compared to the original OCSVM.

3.9 Learning Criteria

3.9.1 Contingency Table Metrics

To fully understand model evaluation using supervised binary classifiers we contextualize
all possible results in a 2X2 table illustrated in Figure 3.8 below. Sometimes referred to as
an Error matrix or Confusion Matrix, we deduce statistics that inform classifier performance
from two possible cases i.e true classifications and miss-classifications. We enumerated
these two possible scenarios as TP: True Positives - correctly predicted labels, TN: True
Negatives - correctly predicted negative labels, FP: False Positives - incorrectly predicted
negative labels (Type-I error) and FN: False Negatives - incorrectly predicted positive la-
bels (Type-II error).

Figure 3.8: Error Matrix

Firstly we examine accuracy, an estimation of how well an algorithm discriminates unseen
instances. Often expressed as a percentage, it is computed by dividing the total count of
correctly classified instances by the overall predictions. Whereas Accuracy as a metric
was used to estimate model performances on out-of-sample data, the metric is prone to
distribution bias since it is reliant on the relative class balance of the outcome. In this
scenario, a more robust variation of the Accuracy score is to use the Mathews Correlation

12In the sense refers to a dataset where the wholesome containment of the majority class is altered by
presence of a minority class.


Chapter 3. Materials and Methods 29

Coefficient (MCC) that takes into account all four quantities of the confusion matrix [78,
79].

To quantify the predictive capacity of a classifier, we estimate the Precision i.e the fraction
of predicted true cases out of total true cases. Expressed by dividing the number of all
correctly predicted outcomes by the total number of outcomes positive predictions. This is
sometimes referred to as Positive Predictive Value (PPV). Moreover, in clinical diagnoses,
it is much tolerable to commit Type-I errors as opposed to Type-II errors. False Negative
results are far more fatal compared to False Positives. We then adopt a measure that helps
us prove right on all positive instances [60]; Sensitivity/Recall. Sometimes referred to as
True Positive Rate 13 is the total number of relevant/positive results correctly classified by
the algorithm.

High recall values are indicative of low Type-II error; low FN counts. Therefore, better clas-
sifiers should have both a high degree of precision and recall. Thus to ease interpretation,
the F1-score14 was used to estimate classification performance of the algorithms. These
performance metrics are summarised in Table 3.4 below. An alternative performance mea-
sures used is Specificity/True Negative Rate (TNR), essentially the number of observations
predicted as Negatives out of total Negative classifications. However, as with any predic-
tive model, wrong predictions (misclassification) are a reality and we need a measure to
quantify wrong predictions by classifiers. The Misclassification Rate; a measure of falsely
classified data by versus total classifications is then used to quantify this error. Let ŷi be
the prediction of data point i for label yi, then the error rate can be defined as

miscn =
1
n
∗ ∑

i
(yi ̸= ŷi) (3.8)

In Table 3.4 below is a summary of performance metrics used in this research

Table 3.4: Evaluation measures for the Confusion Matrix

Metric Expression
Accuracy TP + TN / ((TP + FP + TN + FN))
MCC (TP * TN) - (FP * FN) / sqrt(((TP + FP)(TP + FN)(TN + FP)(TN + FN)))
Precision/PPV TP / (TP + FP)
Sensitivity/Recall TP / (TP + FN)
F1-Score 2 * (Precision * Recall) / (Precision + Recall)
Specificity TN / (FP + TN)

13A measure os a classifiers’ completeness
14The weighted harmonic mean the algorithms’ Precision and Recall


Chapter 3. Materials and Methods 30

3.9.2 Area Under curve (AUC)

In addition to the confusion Matrix, we need a way to visualize, organize and select clas-
sifiers based on their performance as compared to classifiers with No Skill.
We define No skill classifiers as models that predict by chance i.e predictions are not in-
formed by any prior patterns in the data.

With a balanced binary outcome, we obtain post estimation statistics using ROC curves.
This is a graphical representation of the trade-off between FN and FP rates for every possi-
ble cut-off and is useful for visualising Recall/sensitivity and specificity. Better classifiers
present curves more proximal to the top-left corner in the ROC space. This interpretation
is however very subjective and quantifying these would prove a more meaningful ratio-
nale. Ranging from 0 to 1, the AUC can be used to quantify the discrimination capacity of
the model with an AUC of 0.5 suggesting otherwise.

With Novelty detection, class-imbalances15, ROC curves no longer give reliable estimates
of model performance. Similar to ROC, PR curves are used to estimate models’ perfor-
mance. As illustrated in Figure 4.10, the goal is to be in the upper-right-hand corner [80,
81]. PR curves provide a way to summarize the trade-off between TPR and PPV value for
a predictive model using different probability thresholds [82].

3.10 Ethics

The ethical and methodological aspects of this research were approved by the University
of Witwatersrand Human Research Ethics Committee (M200509) and NHLS Academic Af-
fairs and Research Office (28 September, 2020). No human subjects were involved. Surveil-
lance data from the NMC data warehouse was extracted, de-identified, and made available
in a compressed and encrypted WinRAR16 format. All computational experiments were
conducted on a bit-lock-encrypted personal laptop only accessible by the researcher.

15A difference in the numbers of positive and negative instances
16A shareware file archiver and data compression program

https://www.win-rar.com


31

4 Results

In this Chapter, we present the results of the COVID-19 and Malaria classification models
from the NMC surveillance data. We start by describing the population, then findings from
preprocessing and lastly prediction results according to the classification strategies defined
in Section 3.7. We also briefly report on patterns we found interesting from procedures
taken.

4.1 Descriptive Statistics

4.1.1 Malaria Analytical Data

This section presents descriptive statistics for the Malaria out-of-sample analytical dataset.
Firstly, we examine the distribution of clinical outcomes before and after data pre-processing.

(a) Before preprocessing (b) After preprocessing

Figure 4.1: Distribution of Malaria clinical outcome

In the raw (unprocessed) Malaria dataset, the positive-negative clinical diagnosis ratio was
100:268 from 216408 observations. As illustrated in Figure 4.1 above, after preprocessing,
this distribution was distorted. Of the total 40557 observations considered for analysis,
clinically diagnosed Malaria-positive cases accounted for 94.1% (n=38162) with the nega-
tives contributing 5.9% (n=2395); a positive-negative ratio of 159:10. In the Malaria analyt-
ical dataset, we only considered complete cases from 7 features of which 58.35% were male


Chapter 4. Results 32

with the rest female. It was also observed that population age at test followed a bimodal
distribution. Up to about 20 years, there were more Malaria positive than negative cases.
However, as illustrated in Figure 4.2, between the ages of 20 to 45 years, there was more
Malaria negative than positive cases.

Figure 4.2: Age distribution at Test Date

We therefore, adopted an approach similar to the WHO reporting standards by creating
age categories according to the WHO [2] Malaria risk population age groups. The 15-49
year-old population contributed the largest number of observations (approximately 58%,
n=23512) with the older population i.e those over seventy years of age accounting for the
least (1%, n=591).

Looking at the testing periods, 17.6% (n=7145) tests were done in January with the least
done in July. We then investigated the relationship between age at sample test and calendar
month when the test was done. From Figure 4.3, we observe the following; Those up to the
age of 60 were likely to have a Malaria test between April and August. There was no clear
period in which the older population (above 60 years of age) was likely to have a Malaria
test.

Months were extracted from the sample collection dates, after which we categorized them
into four bands according to the South-African weather season.

From Figure 4.4, we observed a fairly even number of samples tested done during autumn1

months (n=14,443) and in the summer2 (n=14,440). Descriptive summaries are presented

1March, April, May
2December, January and February


Chapter 4. Results 33

Figure 4.3: Average monthly tests by age (years)

Figure 4.4: Malaria tests done per Season Calendar

in Table 4.1 below.

The Malaria analytical dataset comprised of four locality attributes namely province, dis-
trict, sub-district and health-facility. We observed less variability in the sub-district and health-
facility and therefore not likely to add predictive power to the classifiers, hence considering
only province and district as features.
Illustrated in Figure 4.5, we observed Limpopo and Mpumalanga provinces accounted for


Chapter 4. Results 34

Table 4.1: Descriptive Summary of Malaria Dataset

Malaria dataset file description (n = 40557)
Characteristic Sub group Distribution: n(%)

Clinical Test Result (Target) Positive 38162 (94.09%)
Negative 2395 ( 5.91%)

Gender Male 23666 (58.35%)
Female 16891 (41.65%)

Age group under5 6703 (16.53%)
5-14 5736 (14.14%)
15-49 23512 (57.97%)
50-69 4015 (9.9%)
70+ 591 (1.46%)

Hospitalization Status In-patient 21378 (52.71%)
Out-patient 19179 (47.29%)

Red Blood Cell count Median (IQR) 4.37 (3.7, 4.91)

Calendar Season at test date Autumn 14443 (35.61%)
Summer 14440 (35.6%)
Spring 8036 (19.81%)
Winter 3638 (8.97%)

Province reporting result Limpopo 19683 (48.53%)
Mpumalanga 8425 (20.77%)
Gauteng 6649 (16.39%)
Kwazulu-Natal 2339 (5.77%)
North West 1176 (2.9%)
Western Cape 1096 (2.7%)
Eastern Cape 519 (1.28%)
Free State 515 (1.27%)
Northern Cape 155 (0.38%)

District where test was done Mopani 9624 (23.73%)
Ehlanzeni 6847 (16.88%)
Vhembe 6529 (16.1%)
Ekurhuleni Metro 3554 (8.76%)
Capricorn 1426 (3.52%)
West Rand 1357 (3.35%)
Waterberg 1161 (2.86%)
Nkangala 1035 (2.55%)
Ethekwini Metro 1006 (2.48%)
Sekhukhune 943 (2.33%)

*IQR: Interquartile Range
There were 52 levels in the feature reporting District where the test was done. The last 10
districts with a reported test done were Joe Gqabi and John Taolo Gaetsewe had 13 observations
each with Harry Gwala reporting 12 tests done. Uthukela, Amajuba, Zf Mgcawu, Umzinyathi,
and Namakwa had 11, 9, 8, 7, and 3 observations respectively. Both Xhariep and Central Karoo
had 2 observations each.


Chapter 4. Results 35

approximately 70% of the analytical dataset (n=28108) with less than 1000 tests done in
Northern Cape province. Because of the extended levels in the attribute district, only the
10-most-frequent districts reported were considered and used for model specification.

Figure 4.5: Malaria tests done per province

4.1.2 COVID-19 Analytical Data

The distribution of clinically diagnosed COVID-19 positive-negative ratio was 162:100. Be-
cause no observations were dropped from the COVID-19 raw dataset, the before/after
preprocessing distribution of the clinical outcome are identical.

Figure 4.6: COVID-19 clinical outcome (raw dataset)

Of the 14 features considered for the analytical dataset, two contained missing values i.e
2.1% (n=739) in age and 0.1% (n=35) in gender.
As illustrated in Figure 4.7, between ages of 10 and 90 years, it was observed that the
COVID-19 negative population was slightly older compared to those with a positive; 58


Chapter 4. Results 36

years (IQR 32.0, 53.0) vs 43 years (IQR 31.0, 54.0). However, suspected COVID-19 cases at
the time of sample collection were on average 31.5 years old (IQR: 31.0 - 53.0).

Figure 4.7: COVID-19 age distribution of the population

It was also noted that not all suspected cases registered symptoms as would be expected
from such a highly infectious condition. Out of approximately 35,000 tests done, Sore
Throat was reported in about 25% (n=8807) of the population.
Although cough, fever, and malaise have been reported in COVID-19 cases, these symp-
toms were less frequently reported in the COVID-19 surveillance as reported in Table 4.2.
Figure 4.8 illustrates the logarithmic distribution of recorded symptoms in the COVID-19
analytical dataset with descriptive summary statistics of the unstratified COVID-19 ana-
lytical dataset in Table 4.2 below.

4.2 Predicting Probable Cases

To determine probable cases for Malaria and COVID-19, first, we investigated relation-
ships between all numerical features to the target using both correlation analysis an chi-
square.From the correlation analysis using the Malaria analytical dataset, we observed to a
considerably large extent weaker relationships between features and the target. However,
the inter-feature correlation between gender and red Cell counts was positive (r = 0.18).


Chapter 4. Results 37

Table 4.2: Descriptive Summary of COVID-19 Dataset

COVID-19 dataset description (n = 35202)
Characteristic Sub group Distribution: n(%)

Clinical Test Result (Target) Positive 21795 (61.91%)
Negative 13407 ( 38.09%)

Gender Male 11559 (32.87%)
Female 23608 (67.13%)

Age groups Below 60 years 27676 (86.1%)
60 years and above 4788 (13.9%)

Fever/Chills/Pyrexia Absent 35175 (99.92%)
Present 27 (0.08%)

Cough Absent 35165 (99.89%)
Present 37 (0.11%)

Sore Throat Absent 26395 (74.98%)
Present 8807 (25.02%)

Shortness of Breath Absent 35192 (99.97%)
Present 10 (0.03%)

Diarrhoea Absent 35199 (99.99%)
Present 3 (0.01%)

Muscle or Joint aches Absent 35195 (99.98%)
Present 7 (0.02%)

Malaise Absent 35197 (99.99%)
Present 2 (0.01%)

Fatigue or Lethargy Absent 35201 (99.99%)
Present 1 (0.01%)

Flu Absent 35200 (99.99%)
Present 2 (0.01%)

Vomiting or Nausea Absent 35201 (99.99%)
Present 2 (0.01%)

Any Comorbidity Absent 25687 (72.97%)
Present 9515 (27.03%)

*IQR: Interquartile Range
Recorded comorbidities included HIV/AIDS, Tuberculosis, Hypertension, Diabetes,
Asthma, Obesity and Cancer, and Chronic Obstructive Pulmonary Disease (COPD)


Chapter 4. Results 38

Figure 4.8: Frequency distribution of recorded symptoms on a log scale

4.2.1 Predictions using Balanced Datasets

From results presented in Table 4.3, all three classifiers (SVC, RF, and KNN) scored equally
on accuracy (94%). However, accuracies were lower when predicting clinical outcomes
from COVID-19 out-of-sample data. The KNN scored lowest on accuracy (59%) whereas
the SVC attained the highest predictive accuracy though these differences were marginal.
In both Malaria and COVID-19 out-of-sample data, SVC yielded the highest on sensitivity
(100%) compared to KNN and RF.

Table 4.3: Performance Metrics on Balanced data (percentage scores on out-
of-sample data).

Malaria Covid-19

Performance Metric SVC RF KNN SVC RF KNN

Accuracy 0.941 0.94 0.942 0.62 0.611 0.59
MCC 0 0.242 0.231 0.026 0.025 0.039
Sensitivity/Recall 1 0.99 0.994 1 0.944 0.826
Precision (PPV) 0.941 0.949 0.947 0.62 0.622 0.628
F1-Measure 0.97 0.969 0.97 0.765 0.75 0.714

Classifier predicted values are presented in Table 4.4 below. At 97%, the SVC, RF, and KNN
classifiers scored higher F1-measure when using Malaria out-of-sample data compared to
COVID-19; 76%, 75% and 71% respectively.
The classifiers generally scored high in predicting positive outcomes averaging at 94%


Chapter 4. Results 39

with the Malaria out-of-sample data. Moreover, compared to COVID-19 out-of-sample
data, PPV scores were generally lower compared to Malaria data; 62.0% for SVC, RF at
62.4%, and KNN 62.8%.

Table 4.4: Confusion Matrices for Malaria and COVID-19: Balanced Data

Malaria COVID-19

Model/Classifier TN FN TP FP ER TN FN TP FP ER

SVC 0 0 9541 599 0.059 9 3 6536 4013 0.38
RF 89 98 9443 510 0.06 275 366 6173 3747 0.389
KNN 70 61 9480 529 0.058 827 1139 5400 3195 0.41

ER: Error Rate
The number of correct and incorrect predictions by SVC, RF KNN stratified by category.

Using AUC are performance metric, we observed a generally better performance in pre-
dicting clinical outcomes using Malaria out-of-sample data compared to COVID-19 where
the SVC, KNN, and RF classifier predicted about 20% more accurately with Malaria data.
These results are presented Figure 4.9 below

(a) Malaria (n=1389 per class) (b) COVID-19 (n=8938 per class)

Figure 4.9: Classifier performance in ROC space


Chapter 4. Results 40

4.2.2 Predictions using Weighted Datasets: Imbalanced Learning

In this approach, the models were refitted on the same sample data this time accounting
for distribution weights of clinical outcomes in model specification.

Observing results presented in Table 4.5, we observe that the SVM yielded a recall score
of 100% on both datasets i.e. the model did not predict a single TN outcome from the
599 observations. The same poor prediction was observed in the COVID-19 data where
the SVC classifier was only able to accurately predict 9 out of 4021 clinically negative ob-
servations. We also observe the SVC classifier predicted the highest number of clinically
positive observations (n=9541).

Table 4.5: Performance Metrics on Weighted data (percentage scores on out-
of-sample data)

Malaria Covid-19

Performance Metric SVC RF KNN SVC RF KNN

MCC 0 0.227 0.239 0.026 0.084 0.032
Sensitivity/Recall 1 0.986 0.992 1 0.486 0.786
Precision (PPV) 0.941 0.949 0.948 0.62 0.663 0.628
F1-Measure 0.97 0.967 0.969 0.765 0.568 0.698

The RF model for COVID-19 had the lowest sensitivity with 48.6%. Accounting for tar-
get distribution weights, all three classifiers (SVC, RF, and KNN) scored higher PPV with
Malaria compared to COVID-19 at 97% across. Results from error matrices are presented
in Table 4.6 below.

Table 4.6: Confusion Matrices for Malaria and COVID-19: Weighted Data

Malaria COVID-19

Model/Classifier TN FN TP FP ER TN FN TP FP ER

SVC 0 0 9541 599 0.059 9 3 6536 4013 0.38
RF 94 134 9407 505 0.063 2371 3291 3248 1651 0.468
KNN 81 80 9461 518 0.059 927 1399 5140 3050 0.42

ER: Error Rate
The number of correct and incorrect predictions by SVC, RF KNN stratified by category.

With F1-measure as primary metric, we observe that SVM outperforms RF and KNN clas-
sifiers when modeled using COVID-19 data, 76%, 57% and 70% respectively. However, the
scores remained similar (97%) when the models were subjected to the Malaria data.


Chapter 4. Results 41

Taking into account imbalances in the class distribution, we present the AUC results over
the Precision-Recall space in Figure 4.10.

(a) Malaria Positive:Negative (1590:100) (b) COVID-19 Positive:Negative (162:100)

Figure 4.10: Classifier performance in Precision-Recall space

We observe higher predictions in Malaria out-of-sample data compared to COVID-19. The
RF AUC was higher with Malaria out-of-sample data testing (98%) versus to 67% with
COVID-19. Though marginally lower compared to RF, AUC for SVC was still higher with
the Malaria data. The same is observed with KNN classifier where the yield was 98% using
Malaria data. The same algorithm correctly predicted 65.5% on COVID-19 outcomes using
out-of-sample data.

4.3 Novelty Detection results

For novelty detection, we first examined the distribution of the target in both the train-
ing and out-of-sample data. We observed severe skewness in Malaria data. The posi-
tive/negative ratio was 16:1 in both the training and out-of-sample Malaria data. Negative
observations (minority) were categorized as outliers to predict this category in the out-of-
sample data. In Table 4.7 below, we present results from unary classification approaches.

Table 4.7: Performance metrics using Unary classification on Malaria data
(percentage scores on out-of-sample data)

Performance Metric OCSVM iForest

MCC -0.05 0.037
Specificity/TNR 0.232 0.09
F1-Measure 0.071 0.092


Chapter 4. Results 42

We evaluate classifier performance in predicting 599 out-of-sample malaria-negative ob-
servations. Results are presented in the Confusion Matrix in Table 4.8 below. Worth not-
ing, OCSVM predicted more negative observation compared to the iForest; 139 versus 54
respectively hence a higher specificity score.

Table 4.8: Confusion Matrix from Unary classification: Malaria data

Model/Classifier TN FN TP FP

OneClassSVM 139 3162 6376 460
iForest 54 517 9024 545


43

5 Discussion

In this Chapter, we discuss empirical findings to the research objectives proposed in Chap-
ter 1. We present both a quantitative as well as a qualitative discussion of the empirical
data reported in Chapter 4. In the last section is a brief discussion of limitations encoun-
tered during the research.

5.1 Malaria and COVID-19 Surveillance Data Profiles

Between January 2015 and December 2019, Malaria prevalence stood at 27% (n=58,692).
Among those clinically diagnosed with Malaria, 57.2% (n=33,198) were from Limpopo
province followed by Mpumalanga (17%, n=9,623). Northern Cape had the least Malaria
cases. Two possible explanations support this result. First, the climatic conditions fa-
vor Malaria prevalence in this area. Secondly, being border provinces, Mpumalanga and
Limpopo have a higher immigrant population from neighboring countries of Mozambique
and Zimbabwe which are Malaria-endemic. These findings are in agreement with statistics
reported in Guidelines for the Treatment of Malaria in South Africa [24].

From the COVID-19 summary statistics reported in Table 4.2, fever was prevalent in less
than 1% (n=27) of the population; 8 observed in COVID-19 positive population with 19 in
the negative population. The same was observed with cough (n=37) and fatigue/lethargy.
The most prevalent symptom in the population with clinical suspicion of COVID-19 was
sore throat (25%, n=8,807).
Among the COVID-19 positive cases, sore throat was prevalent in 25% (n=5,450) of the
population. On investigating the relationship between COVID-19 clinical outcomes and
presence/absence of sore throat, we did not seem to find statistical evidence in the data
to suggest existence of an association between COVID-19 clinical outcomes and the pres-
ence/absence of sore throat (p-value = 0.94).

In disease surveillance, attention is generally accorded to patients with a clinically positive
as opposed to a negative outcome. This variation in capturing data creates gaps in data
over time. Because of this, we considered only complete-cases for the Malaria analytical


Chapter 5. Discussion 44

dataset. Several observations were dropped, of which, the majority had a negative clin-
ical outcome, reducing the dataset to 40,557 observations. Therefore, reporting stratified
statistics from this biased sample would lead to over-estimating Malaria prevalence.

We investigated the association of risk-factors with clinical outcomes. Although people of
all ages are at risk, the older population i.e those above sixty years of age are more suscep-
tible to COVID-19 infection [83]. However, in the analytical dataset, the older population
accounted for 16% (n=3422) of the 21,795 COVID-19 positive cases with the majority un-
der sixty years of age. A 2020 systematic review by Yang et al. [84] suggests an association
between age and comorbidities among COVID-19 patients; suggesting age and comorbidi-
ties as risk factors. The authors reported comorbidities to be more prevalent in high-risk
populations i.e older patients who report the presence of at least one comorbidity.
In this study, among those who tested positive for COVID-19, comorbidities were regis-
tered 27.2% (n=5,935) of the population. This prevalence was similar among those who
tested negative for COVID-19 (26.7%, n=3,580). Using Mantel-Haenszel estimates we ob-
served that those who registered at least one comorbidity had the same risk of testing posi-
tive for COVID-19 as those who tested negative (OR1 = 1.02; 95% CI2: 0.98, 1.08). However,
adjusting for the effects of age, the older population had a 20% (OR = 1.20; 95% CI: 1.06,
1.36) increased risk of testing positive for COVID-19 compared to the younger population.
Noteworthy, out of the 35,202 clinically suspected COVID-19 cases in this research, we ob-
served 6.4% (n=2,256) of the population were high-risk i.e sixty years or older, and had
at least one comorbidity registered. Although we found overwhelming statistical evi-
dence suggesting an association existed between gender and COVID-19 clinical outcomes
(p-value < .01), we did not find any epidemiological evidence to support this finding.

5.2 Classification and Prediction of Clinical Outcomes

Malaria and COVID-19

We applied three different approaches to predict probable cases for Malaria and COVID-
19. The first two approaches to classification and prediction involved resampling the data
and modeling solutions with the data as-is but accounting for distribution weights of the
target. In these two approaches, our primary measure of performance was the AUC. In the
third, a novelty approach, specificity scores were the primary evaluation metric.

Three models were subjected to data in which the target had equal representation. With
1389 observations in each class, the SVM, RF, and KNN models performed better with

1Odds Ratio
2Confidence Interval


Chapter 5. Discussion 45

Malaria compared to COVID-19 out-of-sample data. Though samples in each class were
incomparable, 1389 for Malaria and 8939 for COVID-19, this difference did not seem to
alter the predictive power of models. As illustrated in Figure 4.9, at 80%, the RF classi-
fier recorded the highest predictive power compared to SVM and KNN (75.4% and 78.8%
respectively). We also observed, with COVID-19 out-of-sample data, the models did not
perform any better compared to guess i.e models with no-skill.

As illustrated using Precision-Recall and ROC curves, we observed an overall improve-
ment in classification and prediction when classifiers account for distribution weights of
the target in predictions. However, scores with COVID-19 data though remained lower
compared to Malaria data. Using COVID-19 data, significant improvements were noticed
with the RF, KNN, and SVC classifiers. The RF AUC improved from 56.1% to 67.3% and
the KNN from 53.7% to 65.5% and the SVC from 50.7% to 62.9%. This twelve percent gain
may be attributed the increased number of observations that the models were trained on,
hence learning more from the data to improve predictions. We also observed the predic-
tion error rate was generally higher with RF classifier at though compared to KNN and
SVM, this difference was marginal.

In consideration of distributions in the target as presented in Section 4.1, novelty ap-
proaches were employed only on malaria data. Comparing classification and prediction of
negative observations, the OCSVM performed better than the iForest. Out of 599 out-of-
sample negative observations, the OCSVM predicted 139 correctly attaining a specificity
score of 0.23 versus 0.09 (TN: 54). These results are unsatisfactory as highlighted prior by
the correlation matrix in Section 36.

5.3 Qualitative Evaluation of Results

While in this study we employed disease predictors in singularity, a viable option as pro-
posed in literature is to consider features in combination. For example, in the predictive
diagnosis of Malaria from symptoms, individuals reporting a fever and had a previous
Malaria episode in their household are more likely to yield a positive Malaria result. This
approach will likely increase the predictive power of models compared to models that
consider symptom information distinctively.

Whereas a clinical outcome was available in all observations, symptomatic information
was less informative in the COVID-19 analytical dataset and completely absent in the
Malaria dataset. On the contrary, while the Malaria dataset consisted of laboratory mark-
ers (red blood cell counts), this information was completely lacking in the COVID-19 data.


Chapter 5. Discussion 46

This dissimilarity in data structures was more pronounced in results where models run
on out-of-sample data performed overwhelmingly better on Malaria data compared to
COVID-19. Research has shown that red blood cell counts are an indication of infection in
humans, explaining the variation in prediction results.

When implementing supervised ML predictive models, it is necessary to identify before-
hand relationships between features and targets as well as inter-feature correlations. In
both Malaria and COVID-19 analytical datasets, we identified weaker correlations between
features and targets. Scant prevalences made it difficult to deduce informative correlations
between clinical outcomes with selected features. For this reason, we could not determine
distinct predictors in the COVID-19 dataset and to a fair amount in the Malaria dataset.
One approach to mitigate this pitfall is to employ comprehensive data quality assessments
at data-collection points. Embedding automated integrity and validation checks in data
collection instruments enforces the tidy-data model proposed by Wickham [65] described
in Chapter 3.

In this research, we described the current surveillance data structures and profiles for
Malaria and COVID-19 at the NHLS, South Africa. While clinical outcomes were observed
for all observations, other attributes in the surveillance datasets were inconsistent with
gaps. For example, between 2015 and 2019, disease symptoms, treatment information,
travel and contact history, and case notes were not recorded in up to 100% observations
in Malaria surveillance data. This hugely limited how much information we were able to
use for classification and prediction. In a similar way, among the demographic attributes
observed in the COVID-19 surveillance data, location information was lacking. Therefore,
stratified analysis to determine cases per province to inform the DoH on resource alloca-
tion for C