UNIVERSITY OF WITWATERSRAND MASTERS THESIS Applying Machine Learning To Classify Disease Status For Selected Notifiable Medical Conditions In South Africa. Student: Innocent Lino ERONE Student Number: 1075688 Supervisor(s): Mr. Michael T. MAPUNDU Dr. Trevor Graham BELL A Research Report Submitted to the Faculty of Health Sciences in partial fulfilment of the requirements for the degree of Masters of Science in Epidemiology - Public Health Informatics 26 October, 2021 http://www.wits.ac.za i Declaration of Authorship I, Innocent Lino ERONE, declare that this thesis titled, “Applying Machine Learning To Classify Disease Status For Selected Notifiable Medical Conditions In South Africa.” and the work presented in it are my own. I confirm that: • This work was done wholly while in candidature for a research degree at the Univer- sity of Witwatersrand. • Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated. • Where I have consulted the published work of others, this is always clearly attributed. • Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. • I have acknowledged all main sources of help. • Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself. Signed: Date: 26 October, 2021 http://www.wits.ac.za http://www.wits.ac.za ii UNIVERSITY OF WITWATERSRAND Abstract Faculty of Health Sciences School of Public Health Division of Epidemiology and Biostatistics Masters of Science in Epidemiology - Public Health Informatics Applying Machine Learning To Classify Disease Status For Selected Notifiable Medical Conditions In South Africa. by Innocent Lino ERONE HTTP://WWW.WITS.AC.ZA https://www.wits.ac.za/publichealth http://www.wits.ac.za/publichealth http://www.wits.ac.za/publichealth iii Introduction: There is a change in disease profiles. Environmental variabilities continue to alter mor- phological appearances of species necessitating enhancement in diagnostic methods used to detect diseases. The deterministic approaches applied in the current diagnosis methods for Malaria and COVID-19 have presented challenges of low sensitivity and specificity. In this study, we described data structures and disease profiles for Malaria and COVID-19 surveillance data at the National Health Laboratory Services (NHLS), South Africa. We also explored the application of supervised Machine Learning (ML) to classify and predict clinical outcomes for Malaria and COVID-19. Methods: The COVID-19 surveillance data comprised of 35,202 observations from a unit dataset. The Malaria data was made up of three files; a demographics file, a laboratory results file and a travel-treatment history file of which 40,094 observations were deduced. These datasets were divided into two portions, 75% for model specification and the 25% des- ignated as out-of-sample testing. We compared three supervised ML classifiers: Support Vector Machine (SVM), the K-Nearest Neighbor (KNN), Random Forests (RF) with their variant novelty approaches Isolation Forest (iForest) and One-Class Support Vector Ma- chines (OCSVM) to predict clinical outcomes for Malaria and COVID-19. To account for severe label imbalances, the data with majority class labels was under-sampled to obtain an equal class balance in the target. Novelty detection approaches with iForest and One-Class Support Vector Machines (OCSVM) were also used in classifying and predicting Malaria and COVID-19 clinical outcomes. Results: Malaria surveillance data was characterized by large proportions of missing data for de- mographic, syndromic and environmental characteristics. Though complete, compared to Malaria, COVID-19 surveillance data did not follow tidy-data principles. In evalu- ating classifier predictive power using out-of-sample data with equal representation of clinical outcomes, RF yielded the best predictive power with Area Under Curve (AUC) scores (98%) from Malaria out-of-sample data accounting for distribution weight of clini- cal outcome. Though not comparable to scores from Malaria data, the RF still scored better than the SVM and KNN classifiers from out-of-sample evaluation over COVID-19 data. Generally, lower classifier performance was observed across all models when subjected to COVID-19 out-of-sample data, where the KNN classifier registered the highest num- ber of false-positive results. There were significantly higher numbers of False-Negative predictions with the SVM classifiers compared to the RF and KNN. However, the RF per- formed slightly better in predicting True-Negative observations. By categorizing data with minority clinical outcome representation as outliers, OCSVM predicted more negative ob- servation compared to the iForest. iv Conclusions: This study showed the impact of data quality in disease surveillance with respect to pre- dictive modeling for Malaria and COVID-19 medical conditions. The data were charac- terized by large proportions of incompleteness. Individual demographic characteristics, reported and recorded signs and symptoms among other attributes that hold vital infor- mation for syndromic disease surveillance were lacking. While supervised ML classifiers performed well with Malaria out-of-sample data, the same methods produced suboptimal results with similar surveillance COVID-19 data. Future studies could explore unsuper- vised ML approaches on the same surveillance data. v Acknowledgements Firstly, I would like to thank my academic supervisors Mr. Michael T. MAPUNDU and Dr. Trevor Graham BELL for your immense support and insight throughout the project. Your stimulating discussions informed the direction of this research. Furthermore, I wish to express my gratitude towards Brenda Nansereko whose thorough peer-review helped me write a better thesis. Special thanks to the African Union Center for Disease Control, your support is incomparable. Finally, I would like to acknowledge the academic research team at the National Institute for Communicable Diseases - South Africa, you made this research possible! Innocent Lino ERONE 26 October, 2021 vi Contents Declaration of Authorship i Abstract iii List of Figures vii List of Tables viii 1 Introduction 1 1.1 Epidemiology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Theoretical Background 5 2.1 Overview of NMC Surveillance in South Africa . . . . . . . . . . . . . . . . . 5 2.2 Disease Manifestation and Management . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Case Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.3 Eradication Efforts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Inferential Classifiers: Rule-sets . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5 Machine Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5.1 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5.2 Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5.3 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.6 Estimating Classifier Performance . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6.2 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3 Materials and Methods 14 3.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Study Site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 Study Population and Data Sources . . . . . . . . . . . . . . . . . . . . . . . 15 3.4 Computational Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 vii 3.5 Conceptual Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.6 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.6.1 Curation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.6.2 Data Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.6.3 Feature Selection and Engineering . . . . . . . . . . . . . . . . . . . . 20 3.7 Model Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.7.1 Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.7.2 Classification Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.7.3 Hyper-parameter selection . . . . . . . . . . . . . . . . . . . . . . . . 23 3.7.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.7.5 The k-Nearest Neighbor Method . . . . . . . . . . . . . . . . . . . . . 25 3.7.6 Decision Tree Learning: Random Forests . . . . . . . . . . . . . . . . 26 3.8 Novelty Detection Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.8.1 Isolation Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.8.2 OneClass SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.9 Learning Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.9.1 Contingency Table Metrics . . . . . . . . . . . . . . . . . . . . . . . . 28 3.9.2 Area Under curve (AUC) . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.10 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4 Results 31 4.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.1 Malaria Analytical Data . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.2 COVID-19 Analytical Data . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Predicting Probable Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.1 Predictions using Balanced Datasets . . . . . . . . . . . . . . . . . . . 38 4.2.2 Predictions using Weighted Datasets: Imbalanced Learning . . . . . 40 4.3 Novelty Detection results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5 Discussion 43 5.1 Malaria and COVID-19 Surveillance Data Profiles . . . . . . . . . . . . . . . 43 5.2 Classification and Prediction of Clinical Outcomes Malaria and COVID-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.3 Qualitative Evaluation of Results . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6 Conclusion and Future Directions 48 Bibliography 49 7 Supplementary Tables and Graphs 56 viii 7.1 Missing Value Report - Malaria Data . . . . . . . . . . . . . . . . . . . . . . . 56 7.2 Missing Value Report - COVID-19 Data . . . . . . . . . . . . . . . . . . . . . 57 7.3 Correlation Matrix COVID-19 - Malaria . . . . . . . . . . . . . . . . . . . . . 57 8 Plagiarism Declaration 58 9 TurnItIn Report 59 10 HREC Research Clearance Certificate 60 11 NHLS Research Clearance 61 12 Research Ethics Training Certificate 62 13 Programming and Analysis Codes 63 ix List of Figures 2.1 NMC Reporting Cascade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 SA Malaria Risk Map December 2018. Image credit: DoH SA . . . . . . . . . 7 3.1 A conceptual framework for Supervised Machine Learning; adapted from various internet sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Kernel Density Estimate plot for age at testing (years) . . . . . . . . . . . . . 18 3.3 Preprocessing flow - Malaria dataset. . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Preprocessing flow - COVID-19 dataset. . . . . . . . . . . . . . . . . . . . . . 20 3.5 SVM classifier a case of linearly separable data . . . . . . . . . . . . . . . . . 24 3.6 KNN classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.7 Decision Tree classifier branch . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.8 Error Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.1 Distribution of Malaria clinical outcome . . . . . . . . . . . . . . . . . . . . . 31 4.2 Age distribution at Test Date . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3 Average monthly tests by age (years) . . . . . . . . . . . . . . . . . . . . . . . 33 4.4 Malaria tests done per Season Calendar . . . . . . . . . . . . . . . . . . . . . 33 4.5 Malaria tests done per province . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.6 COVID-19 clinical outcome (raw dataset) . . . . . . . . . . . . . . . . . . . . 35 4.7 COVID-19 age distribution of the population . . . . . . . . . . . . . . . . . . 36 4.8 Frequency distribution of recorded symptoms on a log scale . . . . . . . . . 38 4.9 Classifier performance in ROC space . . . . . . . . . . . . . . . . . . . . . . . 39 4.10 Classifier performance in Precision-Recall space . . . . . . . . . . . . . . . . 41 7.1 Correlation Matrix for COVID-19 analytical dataset . . . . . . . . . . . . . . 57 x List of Tables 3.1 Computational Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Malaria Dataset definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 COVID-19 Dataset definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Evaluation measures for the Confusion Matrix . . . . . . . . . . . . . . . . . 29 4.1 Descriptive Summary of Malaria Dataset . . . . . . . . . . . . . . . . . . . . 34 4.2 Descriptive Summary of COVID-19 Dataset . . . . . . . . . . . . . . . . . . . 37 4.3 Performance Metrics on Balanced data (percentage scores on out-of-sample data). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 Confusion Matrices for Malaria and COVID-19: Balanced Data . . . . . . . 39 4.5 Performance Metrics on Weighted data (percentage scores on out-of-sample data) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.6 Confusion Matrices for Malaria and COVID-19: Weighted Data . . . . . . . 40 4.7 Performance metrics using Unary classification on Malaria data (percentage scores on out-of-sample data) . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.8 Confusion Matrix from Unary classification: Malaria data . . . . . . . . . . . 42 7.1 Proportion of missing data - Malaria raw dataset . . . . . . . . . . . . . . . . 56 7.2 Proportion of missing data - COVID-19 raw dataset . . . . . . . . . . . . . . 57 xi List of Abbreviations AUC Area Under Curve CV Cross Validation DoH Department of Health ICD International Classification of Disease iForest Isolation Forest KNN K-Nearest Neighbor MCC Mathews Correlation Coefficient ML Machine Learning NHLS National Health Laboratory Services NICD National Institute for Communicable Diseases NMC Notifiable Medical Conditions OCSVM One-Class Support Vector Machines PCA Principal Component Analysis PPV Positive Predictive Value PR Precision-Recall RDT Rapid Diagnostic Test RF Random Forests RIM Rule Interestingness Measures ROC Receiver Operator Characteristic SVC Support Vector Classifier SVM Support Vector Machine TPR True Positive Rate WHO World Health Organization 1 1 Introduction 1.1 Epidemiology Although there is a global decline in incident cases1, Malaria is still the most common disease in Africa and globally [1] with the World Health Organization (WHO) estimating 405,000 deaths from 228 million clinical episodes in 2018 alone [2]. There have been global efforts to accelerate the elimination of Malaria through improved diagnostic testing and treatment especially in the WHO low and medium-income countries which have reduced the incidence rates of Malaria [3]. However, the rates of decrease of Malaria incidence and mortality are still low in countries with low resourced health systems and limited abil- ity for system improvements [3]. The Global Technical Strategy for Malaria (2016 to 2030) adopted by the World Health Assembly in 2015 aims at reducing Malaria-attributable cases and deaths by ninety percent by 2030 through integrating active surveillance by interven- tions [1]. Surveillance systems are effective in the elimination of parasites through tracking Malaria transmission and pathways which focus on diagnosis, treatment, and prevention resources [4]. COVID-19 is a highly transmitted disease that was first reported in Wuhan China in De- cember 2019. This disease is caused by a zoonotic virus which was named severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [5]. Coronaviruses belong to the Coro- naviridae family under the Coronavirinae subfamily, and they have been known to cause several other infections in humans since the 1960s [6]. Globally, the disease has imposed a great public health burden in many countries across the world due to its high transmis- sion rate, with the WHO reporting over 113,000,000 cumulative confirmed cases and over 2,500,00 cumulative deaths as of March 2021 [7, 8]. Recent statistics indicate Africa is one of the least affected continents with over 2.8 million cumulative confirmed cases and over 70,000 deaths, contributing 3% of the global cumulative COVID-19 related deaths [7] with South Africa as the COVID-19 epicenter in Africa [9]. As part of the road map to eliminate diseases such as Malaria and COVID 19, countries need to ensure improved testing and follow-up on infection rates [4]. As Basu and Sahi 1From 71 to 57 cases per 1000 population at risk between 2010 and 2018 Chapter 1. Introduction 2 [10] argue, early diagnosis and treatment reduce mortality rates and morbidity. Over the past decades, there has been an evolution in the diagnostic testing techniques for Malaria disease. Though widely adopted, diagnostic systems such as light microscopy and Rapid Diagnostic Tests (RDTs) are dependent on biomarkers. However, RDTs and microscopy are reported to have low sensitivity2 and specificity3 for Malaria [11]. However, research still highlights significant challenges in the diagnosis of Malaria. Acting as parasite reser- voirs [11], asymptomatic individuals fuel resurgence of the disease years after reported treatment. The current diagnosis methods for COVID 19 are laboratory-based and rely on biomark- ers. Nevertheless, challenges associated with the diagnosis techniques exist [12]. These include shortages of test kits and long waiting times for results among others. Much as research continues to support the association of epidemiological profiles of disease with environmental variability, the current COVID-19 diagnostic techniques do not account for factors such as demographic information in diagnostic procedures. On the other hand, diagnostic techniques that assess the patient’s signs4 and symptoms5 are reported to have poor diagnostic properties especially among asymptomatic patients [13]. Subjective diagnosis of disease from symptoms is a vital component of disease surveil- lance. Therefore, to aid clinical diagnosis, there is research and innovation in the use of self-learning approaches to cope with these changing patterns. Commonly known as Ma- chine Learning (ML), self-learning has been used to predict previously complex conditions like cardiovascular diseases [14], obstructive pulmonary disease [15] among others. These stochastic methodologies use a wide array of features to identify hidden patterns in the data to predict disease outcomes. For example, by incorporating low-level features such as texture to digitized human blood smear from slides, Khan et al. [16] used K-means clus- tering to identify Malaria parasites with 95% accuracy. In the same way, using computer vision, Molina et al. [17] used unsupervised ML to identify Malaria parasites. The ap- proach yielded a sensitivity score of 100% with specificity at 90%. In a similar approach, a 2018 systematic review by Poostchi et al. [18], the authors suggested that a well-defined predictor should incorporate several factors such as characteristic of the microscope, type of staining, slide preparation to image analysis in Malaria predictive approaches. 2The ability of a test to identify people with a disease usually expressed as a proportion 3The ability of a test to correctly identify people without a disease; the proportion of negatives that are correctly identified. 4Any objective evidence of disease 5Subjective evidence of disease Chapter 1. Introduction 3 Disease profiles are constantly changing and environmental variability is altering the ac- cepted morphological appearances of species [19]. Therefore, for successful control and eventual elimination of Malaria and COVID-19, more sensitive detection methodologies that incorporate symptomatic information with laboratory markers are needed. As of the year 2020, Malaria and COVID-19 accounted for the highest volumes of surveillance data at the South Africa National Institute for Communicable Diseases (NICD)6. Nevertheless, the institution uses deterministic approaches to predict classes/labels for Notifiable Med- ical Conditions (NMC). This strategy implements pre-determined rule-sets upon selected laboratory scores; an approach that is sensitive logically and rapidly becomes complex with the addition of features. Therefore, in this research, we explore stochastic discrimina- tive approaches as an alternative to deterministic methods in predicting disease labels for Malaria and COVID-19. We constructed classifiers to accurately discriminate Positive and Negative Malaria and COVID-19 cases from demographic, symptom, and laboratory data. These classifiers could then be used for discriminative analysis to segregate probable cases of Malaria and COVID-19 with new data. 1.2 Statement of the Problem Growing data dimensionality is a real threat to deterministic/rule-set7 classifiers. Yet to en- hance clinical diagnosis, it is necessary to look at a broad spectrum of data points/features not only from the laboratory but also from non-laboratory markers such as symptoms. As with most legacy systems, self-learning8 in rule-set classifiers is absent. In light of chang- ing rule-sets, continuous learning and domain expertise become mandatory to keep such classifiers relevant. To cope with changes in data structures, self-learning approaches in ML become necessary. We chose COVID-19 and Malaria for this research because of the high volume of data readily available to experiment with this ML models. 1.3 Research Question The questions of interest are: 1. Are supervised ML techniques better than rule-set approaches in the classification of Malaria and COVID-19 at the NICD? 6A national public health institute located in Johannesburg, South Africa 7A set of human-crafted conditions to trigger a decision or choice. In computer science, knowledge is presented and handled as logical rules implemented by an inference engine. 8The ability to recognize patterns, learn from data, and become more intelligent over time Chapter 1. Introduction 4 2. What can be said of the current deterministic approaches that categorize Malaria and COVID- 19 in respect to current surveillance data profiles? To answer these questions, in this research, we explore stochastic self-learning classifiers using supervised ML prediction techniques in predicting disease status using Malaria and COVID-19 surveillance data from the NICD. We also perform a comparative analysis of current deterministic approaches and how it performs compared to novel ML classification approaches to achieve a maximum reward. Therefore, the work of this project aims to: 1. Describe the current surveillance data structures and profiles for Malaria and COVID- 19 in South Africa. 2. Identify optimal ML algorithms that can be utilized to classify and predict Malaria and COVID-19 clinical outcomes from available data structures. 3. Evaluate the performance of selected ML algorithms against the conventional rule- set methods used to categorize disease status for Malaria and COVID-19. 1.4 Thesis Structure The rest of this report is organized as follows: In Chapter 2, we explore general concepts in Malaria and COVID-19 NMCs from case identification, clinical manifestations to treat- ment. We also explore the current NMC surveillance approach at NICD. In Chapter 3, we describe the various materials and methods used in this research with empirical data (re- sults) presented in Chapter 4. Lastly, a discussion of the research findings is presented in Chapter 5 along with conclusions from the research in Chapter 6. 5 2 Theoretical Background In this section, firstly we briefly look at Malaria, in the context of disease manifestation and management while highlighting ideal parameters to aid clinical diagnosis. Secondly, we give an overview of the NMC surveillance process as conducted by the NICD. Herein, we also covers specific key concepts and constructs that inform the direction of this research. Some of the questions addressed include; what approaches are available to address the objectives stated in Chapter 1 and what classifiers are available for this task and how to determine optimal classifiers. To answer these questions, we begin by exploring classifi- cation approaches relating to this research and what alternatives are available to answer these questions. The last section explains the performance metrics that are available to aid model selection. 2.1 Overview of NMC Surveillance in South Africa Globally, Monitoring and Evaluation (M&E) programs are used to collect health-related data [20]. These data are then used to track progress towards targets, assess the impact of current health interventions and the WHO goals of morbidity control and elimination [21, 22]. NMC surveillance is a vital process in providing necessary information to timely and accurately detect public health threats. The National Department of Health (DoH) of South Africa defines NMC as diseases that are of public health importance [23] because of the risks they pose. As illustrated in Figure 2.1, this reporting follows an upward cascade starting at the Health establishment level, Sub-District, or District level then to the national system. It is a legal obligation for all health practitioners to report diseases classified as NMC to the DoH. At the NICD, NMC reporting timelines vary depending on severity. The National Guidelines for the Treatment of Malaria in South Africa 2019 require all Category-1 condi- tions be reported within 24 hours of first diagnosis irrespective of laboratory confirmation [24]. For Category-2, within 7 days of receipt of laboratory confirmation; within 7 days of diagnosis for Category-3 [23] and up to one month of diagnosis for Category-4. Chapter 2. Theoretical Background 6 Figure 2.1: NMC Reporting Cascade 2.2 Disease Manifestation and Management 2.2.1 Overview In South Africa, Malaria is regarded as a category-1 NMC and therefore, must be reported within 24 hours of first diagnosis irrespective of laboratory confirmation [23]. With Light Microscopy using Giemsa-stained thick/thin blood smears as the yardstick [25, 26], sev- eral diagnostic methods have been adopted to support these global efforts to reduce and eventually eliminate Malaria [27, 28]. However, not all standards are ideal especially in Malaria-endemic areas and affordable Point-of-Care diagnostics (Rapid Diagnostic Tests) have been reported to have differing sensitivity and specificity [29, 30, 31]. It is argued that in South Africa, malaria is mainly transmitted along the border areas with some parts of South Africa’s nine provinces (Limpopo, Mpumalanga and KwaZulu-Natal) endemic for malaria[32]. In the Figure 2.2 is an illustration of the disease severity in South Africa. Chapter 2. Theoretical Background 7 Figure 2.2: SA Malaria Risk Map December 2018. Image credit: DoH SA Diagnosis of COVID-19, a Category-1 NMC is based on biomarkers that are related to the organisms that cause disease. The United States Center for Disease Control and Prevention recommends two types of tests; a viral test that detects current infection and an antibody test that detects the previous infection. The approved assays used for testing detect either COVID-19 nucleic acid or antigen in the upper or lower respiratory specimen which are either the oral or nasal swabs to determine whether an individual has COVID-19 or not [33]. 2.2.2 Case Identification Almost all Malaria deaths are caused by Plasmodium falciparum [34] with pregnant women, older persons, children under 5 years, and those with co-morbidities at greater risk. Symp- tomatically, uncomplicated Malaria is known to cause fevers and chills, headache, and gen- eral body weakness in those infected by the parasite. Left unattended, the disease may rapidly progress into Severe with patients exhibiting one or more conditions including very low blood glucose levels, low haemoglobin1, pulmonary oedema, renal failure, breathing 1Less than 50 g/L (5 g/dL) Chapter 2. Theoretical Background 8 distress, relaxed blood pressure2, convulsions, and sometimes multisystem failure among others. The most common symptoms at onset COVID-19 are fever, cough, and myalgia or fatigue while the less common symptoms are sputum production, headache, hemoptysis, and di- arrhoea [35]. However, studies continue to indicate many patients with COVID-19 either do not manifest any symptoms or registering mild symptoms of the disease and these cases spread the virus to other non-infected persons [36]. These COVID-19 Asymptotic cases increase complexities in active surveillance especially and screening or classification, a factor impeding efficient prevention and control of the disease. Studies have indicated the relationship between the risk of infection and comorbidities. There is an increased risk of COVID-19 infection especially in persons with pre-existing conditions [37, 38]. It is re- ported diseases such as hypertension, diabetes, and respiratory disease are more prevalent among fatal cases [39]. 2.2.3 Eradication Efforts Eradication of Malaria requires a multi-disciplinary effort from the active treatment of asymptomatic cases [40] to socio-economic improvement [41]. With the Malaria vaccine3 in the trial phase [42], artesunate-based medications are still the WHO recommended stan- dard treatment for both uncomplicated and severe Malaria in humans. Without proper management, falciparum Malaria is known to persist in some individuals several years af- ter leaving Malaria-endemic areas [43]. Therefore, any individual who has a fever and has been to a Malaria-endemic area is at risk. Galatas, Bassat, and Mayor [40] argue symp- tomless cases that persist fuel transmission. Therefore, to minimize missed diagnosis of sub-clinical Malaria, a high index of suspicion is required [44]. Incorporating patient demographic information with reported symptoms and laboratory markers are essential for more accurate results. For example, Luo et al. [45] showed that incorporating patient demographics and laboratory results provided a powerful discrim- inant for ferritin. In a 2017 study, with 8-features from the Kenyan Malaria Indicator Sur- vey data, Rajpurkar, Polamreddi, and Balakrishnan [46] proposed a deep learning agent to predict the likelihood of one testing positive for Malaria using individual demographic characteristics. In the same way, in a 2020 study by Lee, Choi, and Shin [47], six ML mod- els were compared using patient clinical information to predict Malaria. In this study, by incorporating ones’ nationality as a demographic characteristic alongside recorded symp- toms, the Random Forest yielded the best scores with an accuracy of 90.3% (AUC = 73.2%) 2Less than 70 mmHg in adults and 50 mmHg in children 3RTS,S/AS01 (RTS,S) Chapter 2. Theoretical Background 9 In South Africa, COVID-19 eradication efforts so far are geared towards preventive ap- proaches to stop the further spread of the virus. As of April 2021, these measures have been boosted with vaccination roll-out starting with the most at-risk populations. As Huang et al. [35] asserts global efforts to control and eventually eliminate COVID-19 still lack early detection methods. These approaches should include improved methods for prediction and classification of the disease to reduce the transmission and improve patient survival rate. Different ML algorithms have been applied in the prediction and classification of COVID- 19. For example in a 2020 study, Hamed, Sobhy, and Nassar [48] employed a KNN variant algorithm to determine COVID-19 disease classification using incomplete heterogeneous data. The experiments showed KNN-variant algorithm outperformed both the modified KNN and standard KNN on the accuracy, precision, recall, and F1 scores performance metrics. Moreover, in a similar study, Iwendi et al. [49] reported the Boosted RF algorithm as an optimal predictor for COVID-19 where data was imbalanced4. 2.3 Classification In this study, we theoretically define classification as relating to the possible outcome of events occurring in a finite space i.e belonging to a specific category. However, this concept may not be interpreted the same way as the widely adopted WHO International Classifi- cation of Disease (ICD) which is essentially a list of Causes-of-Death to inform mortality and morbidity statistics [50]. James et al. [51] define classification as predicting a qualitative5 response for an observa- tion. Given an instance, classification algorithms induce predictive rules based on features and patterns in the data to predict classes with predicted labels assuming a minimum of two levels [52]. These classifiers employ statistical and computational models to segregate datasets into categories. As an example, an algorithm that distinguishes kidney function- ality based on their estimated glomerular filtration rate "Severe/Abnormal/Normal" in pa- tients can be regarded as a Quaternary-classifier; the outcome result must belong to only one level i.e severe, moderate, abnormal or normal. There exist far more complex classifiers for example document categorization algorithms that scavenge thousands of topics and group them into themes. Although more than one class may be considered, for simplicity, this research focuses on Binary Classification i.e. 4A dataset where the class distribution is unequal 5Often referred to as categorical variable Chapter 2. Theoretical Background 10 Presence or Absence of disease. In this section, we explore both Rule-based and ML classi- fiers. 2.4 Inferential Classifiers: Rule-sets Rule-set classifiers rely on a set of predetermined inferential rules to determined classes. These inference rules are set into the system to churn data into discreet prediction out- comes [53] i.e. absence or presence of disease. Theoretically, in rule-based methodologies, there is no limit on rules applied. However, as with any other strategy, these classifiers do not come without challenges. The approaches are characterized by inconsistencies, dif- ficulty in maintaining business rules, and long load times, among other drawbacks [54]. Because of this subjective nature, there is always a trade-off between complexity in deci- sion logic and accuracy in prediction outcomes. Rule-set classifiers adopt inductive logic programming where each rule consists of a prior condition sometimes called an antecedent and a consequent/resultant. These classifiers take the form i f LEFTthenRIGHT (2.1) The rules dictate if the "LEFT" hand-side of the rule is satisfied, it should imply the "RIGHT" hand-side which in this case is the class label we are predicting. In practice, Rule-based classifiers take into account all the rules to determine their performance. To estimate Rule Quality, we use Rule Interestingness Measures (RIM) to distinguish between rules. This is an area still under research with no standard notations available yet [55]. Moreover, Piatetsky-Shapiro [56] proposes three criteria every RIM should satisfy. 1. The measure should be Zero if NBoth = (NLe f t x NRight)/NTotal 2. The measure should increase monotonically with NBoth 3. The measure should decrease monotonically with each of NLe f t and NRight where NLe f t : Count of instances matching LEFT NRight : Count of instances matching RIGHT NBoth : Count of instances matching both LEFT and RIGHT NTotal : Total number of instances Chapter 2. Theoretical Background 11 2.5 Machine Learning Classifiers Because of their ability to adapt, learn and continuously improve, ML algorithms are in- creasingly being used to make predictions in critical contexts [57] where the main goal is to maximize generalization i.e the ability to classify new data [53] previously unexposed to the classifier. These algorithms can pass data, learn from it and apply the newfound knowledge to make intelligent decisions. This is achieved by creating mathematical func- tions that relate input to desired output with differing complexities. On a broad scale, the algorithms are organized into a taxonomy based on the desired outcome [58]. In this section, we briefly describe the three general categories. 2.5.1 Unsupervised Learning Sometimes, we are not fully aware of what features (X) should inform modeling solutions to classify a target/outcome (y). Our goal is then to explore the data and discover interest- ing patterns and properties [59] in the data as opposed to prediction. These learning meth- ods are termed Unsupervised Learning. Techniques such as Principal Components Analysis 6 and Clustering7 are typically used to provide labels (clusters) or values (rankings) [60] before supervised techniques are applied. Unsupervised learning is by far a subjective process and for this reason, can be hard assessing performance from these approaches. As James et al. [51] argue, there is no universally adopted mechanism to validate results from an independent dataset. 2.5.2 Semi-supervised Learning A variation of Unsupervised Learning, Semi-supervised Learning is sometimes the appropri- ate choice especially when the datasets contain only a small portion of labeled data. Using ensemble methods8, the algorithms then generate annotations for the unlabelled data to quantities large enough to appropriately train the models. In principle, the bootstrapping process employs a supervised learning approach to classify these unseen data. To evaluate these classifiers, it is worthwhile having genuine annotated data for evaluation [60]. 2.5.3 Supervised Learning These learning algorithms are ideal for discreet outcomes i.e. the underlying output vari- able can only assume one of two states such as diseased or not-diseased (binary classi- fication). Algorithms such as decision tree induction, SVM, KNN, RF [61] among others 6A tool used for data visualization or data pre-processing 7A broad class of methods for discovering unknown subgroups in data 8A set of classifiers whose individual decision are combined in some way to classify new examples Chapter 2. Theoretical Background 12 provide mechanisms to learn from annotated data and make predictions on new data. All these algorithms exhibit unique strengths and largely depend on the data quality and task at hand. In supervised learning, we assume a functional relationship exists between input and out- put. Let {x, y} be a set of attributes where y is the class label of instance x, then the at- tributes for disease (D) classification will be a set of dependent variables together with the clinical diagnosis consisting of Positive cases (D+) and disease Negative cases (D−). In other words, the algorithm assumes the form D+ ⋂ D− = ∅ where the output has been labeled a priori [62] i.e. there is some knowledge of the data. 2.6 Estimating Classifier Performance To measure how well a model performs, it has to be evaluated on specified metrics. This is done by subjecting the algorithm to data previously unused in the training process or em- ploying other proven schemes. A common approach is to split the data into chunks with 80% for training, 20% for testing. This is done to determine how accurately our predicted classes match the known labels in the evaluation set [60]. 2.6.1 Optimization However, as with most ML tasks, splitting data is hampered by the number of observations to apportion for training, testing and evaluation. As a result, some of the data are likely to be used both during model training as well as testing at the same time. This situation is sometimes called contamination and is likely to result in invalid estimates. On the other hand, not all features are important to train models. Sometimes less-important attributes may be used to fit classifiers [51]. This allows models to learn non-linear patterns in the data leading to high variance. This behavior is also called model over-fitting and usually happens where a model performs well on training data and suboptimal on out-of-sample data. On the other hand, poor performance during training may yield better results during out- of-sample data testing. In this case, the model is said to underfit the data. Model-misfit (under-fitting and over-fitting) is often a problem in predictive analytics and require atten- tion. One way to address model-misfit is to account for the unequal distribution of classes. In this approach, a classifier is fit with distribution weights of the target specified as hyper- parameter. Another alternative is to resample the data to obtain equal representation in the target [63]. Chapter 2. Theoretical Background 13 Another approach involves using robust approaches like Cross-Validation (CV) with K- Fold and Grid Search strategies. CV provides a method for evaluating how good a model fitted generalizes new data. With K-fold CV, the data (X, y) is randomly split into disjoined K 9 subsets. The classifier is then iteratively trained using every single bin as testing data with the rest (K - 1 subsets) as training data after which the average performance is deter- mined. In the Grid Search technique, a set of all possible combinations of settings specified in the parameter grid are iteratively passed to a model using the CV strategy. After this it- erative process, settings that yielded the highest scores from the validation are returned for model specification and generalizability. Given that the technique can be computationally intensive, it is highly dependent on performance metrics from the K-fold CV optimizer. Notation: Assume a labeled dataset (X, y) with an input matrix X of n x m dimension and output vector y of nx1 dimension, fitting a statistical model p which given the i-th sample from X can predict the i-th element in y. Now the goal is to fit p such that for new input Xi we are still able to predict yi. 2.6.2 Model Selection For this research, we used two common techniques in predictive analytics i.e the Confusion Matrix10 also called the Error Table and AUC. In addition to the confusion matrices and the AUC, we also visually represented the classifier performance using the Receiver Op- erator Characteristic and the Precision-Recall (PR) curves. These are further explained in Section 3.9. 9K denotes a positive integer greater than 2. usually 10 is appropriate 10A special kind of Contingency Table with two rows and two columns that reports the number of false positives, false negatives, true positives, and true negatives. https://en.wikipedia.org/wiki/Confusion_matrix 14 3 Materials and Methods In this chapter, we describe the study population from which data were drawn along eth- ical considerations guiding the research. We also systematically address the procedures taken adopting specific supervised ML algorithms to address the objectives stated in Chap- ter 1, from model specification to evaluation criteria. 3.1 Approach This is a non-population-based retrospective study that analyses secondary Malaria data collected over a 5-year period (January 2015 to December 2019), and COVID-19 data col- lected over one year (March 2019 to March 2020) by the National Health Laboratory Ser- vices (NHLS), South Africa. The study utilizes pre-processed data generated or accumulated through suspected-case notification systems informing NMC surveillance from point-of-care/health facilities as well as laboratory results originating from sample testing. 3.2 Study Site This research was conducted at the NICD, a division of the National Health Laboratory Services1 located at Sandringham, Johannesburg South Africa. Located on the southern tip of the African continent between 29◦00‘S and 24◦00‘E, South Africa experiences a varied climate throughout the year with colder months between June to August with the warmer months from December to February2. Covering a total area of 1,219,602km2, the country’s landscape ranges from the lowvelds and bushvelds of Limpopo and Mpumalanga, the highvelds of Gauteng and Free State, the Eastern Highlands of Kwazulu-Natal and parts of Eastern Cape, the great Karoo of Western Cape to the Bushland, Namaqua Lands and Griqualand of Western and North Western and Northern Cape[64]. 1The largest diagnostic pathology service in South Africa 2Geography and climate | South African Government. (n.d.). Retrieved October 19, 2021, from https://www.gov.za/about-sa/geography-and-climate Chapter 3. Materials and Methods 15 3.3 Study Population and Data Sources These data composed of clinical diagnosis, recorded vital signs and symptoms, reported risk factors, laboratory results, and demographic information from suspected cases in both migrant and static populations as they presented at point-of-care facilities. Owing to to- pographical and socio-economic differences, we selected data from all districts and sub- districts accumulated from all the nine provinces stored in the Surveillance Data Ware- house at the NICD . Malaria: Malaria data used for this research was made up of three Comma Separated V files. De- scribed below, these files were linked together using episode_no as key-field. • MalariaDemographics - consisting of clinical notification data (cases identified through the NMC app) with 222,805 unique observations from 10 variables • MalariaResults - A repeated measures file with 766,074 observations from Laboratory tests • MalariaExtra1 with 40094 observation and 20 attributes (excluding key-field) con- taining observed symptom along with treatment information, records of travel (in- cluding dates) and contact history. COVID-19 This was a unit dataset (COVID-19.csv) with 35202 observations from 25 variables, exclud- ing episode number (key-field). This datafile contained patient demographic information, recorded signs and symptoms, and reported comorbidities. 3.4 Computational Environment While there exist alternative Integrated Development Environments that yield the same re- sults, the pros and cons associated with them is a subjective topic. However, we chose and setup our computational environment using both licensed commercial and open-sourced BSD licensed tools. These tools and utilities were hosted using Microsoft Windows Oper- ating System. In the Table 3.1 below is a full listing of resources used. 3.5 Conceptual Framework We conformed with the agile software development methodology and adapted the gener- ally acceptable framework for supervised ML illustrated Figure 3.1 below. In focus, we Chapter 3. Materials and Methods 16 Table 3.1: Computational Environment Utility Version Notes PANDAS 0.24.2 For high-performance data structures and analysis tools SKLEARN 0.24.1 Tools for predictive data analysis SEABORN 0.9.0 Python data visualization library based on matplotlib3 NUMPY 1.16.2 Numerical library to facilitate the data management process PYTHON 3.7.4 Programming language. Based on the Anaconda Integrated Development Environment (IDE) STATA 15.1 Statistical Packaged for analysis (IC Edition) COMPUTER 10 Pro Intel® Core™ i5 2.3GHz Processor, 16Gb memory 64-bit Mi- crosoft Windows Operating System SKLEARN includes class libraries for ML models. Source: https://scikit-learn.org/stable SEABORN based on MATPLOTLIB graphic library PYTHON, interpreted object-oriented programming language. Source: www.python.org/ STATA statistical software with annual updates. Source: www.stata.com/ tackled aspects of the iterative process that drives the development of ML models. Key considerations included volume and nature of data used, distributions in attributes, in- dustry standards and approaches, assumptions to decisions taken among others. Figure 3.1: A conceptual framework for Supervised Machine Learning; adapted from various internet sources 3.6 Preprocessing During the data-extraction phase, key vectors that will define the dataset and tune the algorithm are deduced, cleaned and standardized. Because Malaria data were received as Chapter 3. Materials and Methods 17 three separate files uniquely identified by episode numbers, we concatenated them using the Inner Join4 strategy to obtain a single entity. An insight into the pre-processing is summarised below:- 3.6.1 Curation In an ideal world, data is clean and ready for analysis. However, this is not always the case. Real-world data are messy. Adopting normalization approaches as employed in re- lational databases, Wickham [65] proposes the tidy-data model where every variable forms a column, each observation forms a row and each type of observational unit forms an en- tity. There are interesting proposals in the literature regarding data tidying. However, proposed methodologies may not necessarily be applicable in all situations as datasets usually differ. In this research, data were received as flat tables, with features clearly defined by columns and rows denoting observations. Checks for missing values and transcription errors were done and where possible corrected from referenced features. String values were encoded by mapping feature schemas5 and where possible data re-coded with appropriate data types enforced. Correlation matrices were used to identify probable patterns and relation- ships between attributes. To detect peculiarities in the data, exploratory data analysis was done using distribution plots to detect peculiarities6 values in the data using the notation below. Let Tk denote the threshold value for a certain feature K that follows a skewed distribution, Tk = 1.5 ∗ IQR7; a value is said to be peculiar if it falls either below or above the threshold Tk. Consequently, these instances were dropped from the dataset. An example from the Malaria data is shown in Figure 3.2 below where we noted 1544 probable erroneously recorded ages i.e above 140 years. These implausible data did not fit within normal limits and were consequently not considered for analysis. 3.6.2 Data Definition We filtered and subjected only records with information across the two datasets to the data cleansing and feature engineering process as described in subsection 3.6.3. An observation was considered candidate for use if and only if a laboratory test record was successfully linked in the demographic-information dataset via an episode number. 4A join operation in relational algebra - combining from entities in a relational environment (wikipedia) 5Categorical classification of a variable 6Out-of-range 7Interquartile Range also called Midspread is a statistical measure of dispersion Chapter 3. Materials and Methods 18 Figure 3.2: Kernel Density Estimate plot for age at testing (years) From this iterative process, we deduced feature vectors to inform approaches at particular steps in ML. As Jutte, Roos, and Brownell [66] assert, the process requires extensive re- sources to assemble key indicators. In our research, two data files were used; Malaria and COVID-19. Malaria: This dataset consisted of laboratory markers from laboratory measures as well as demo- graphic attributes informed by case-notification data. These data included test methods, clinical symptoms, specimen measures parasite and cell counts, test dates, triage informa- tion like episode number, admission status among others. The aggregated dataset contained 37 features. Of the 216,408 observations, more than half the predictor variables had over 80 percent missing information. These predictors could neither be imputed nor used for model specification. Consequently, these missing data were dropped from the final ana- lytical dataset as shown in Figure 3.3 below. A missing values report is shown in Table 7.1 annexed in Appendix 7. In the table below is a high-level description of the Malaria analytical data used with a descriptive summary presented in Section 4.1. COVID-19: On the contrary, in deducing the COVID-19 analytical dataset, we did not drop missing Chapter 3. Materials and Methods 19 Figure 3.3: Preprocessing flow - Malaria dataset. Table 3.2: Malaria Dataset definition sn Variable DataType Description 1 Target String A laboratory confirmed malaria test result 2 Gender String Participant recorded gender 3 in_patient string Participant hospitalization status 4 age_tested_years Integer Participant recorded age (in years) at time of malaria test 5 red_cell_count float Red Blood Cell (RBC) count from laboratory 6 weather string Calendar season deduced from the South Africa meteorological calendar 7 district_name string District where test was done 8 province string Province reporting malaria test-result data. Using regular-expression text processing techniques, we inferred 14 features; 13 bi- nary and 1 continuous from 24 candidate attributes in the raw data. A missing value report is annexed in Table 7.2 of Appendix 7. To inform probable symptoms, these features were categorized to fall in either of the following groups:- fever/chills, cough, sore throat, shortness of breath, diarrhoea, muscle/joint pains, malaise, fatigue/lethargy, influenza, and vomiting/nausea. Chapter 3. Materials and Methods 20 Figure 3.4: Preprocessing flow - COVID-19 dataset. Because of inconsistencies in which this information was captured, a pooled indicator Co- morbidityYN was created to indicate a ‘Yes‘ - 1 if any comorbidity were registered or ‘No‘ - 0 if neither. Below is a description of the COVID-19 dataset with descriptive summary statistics provided in section 4.1. 3.6.3 Feature Selection and Engineering In predictive analysis, not all features in a dataset are important for classification and pre- diction. Yet, there is no one-size-fits-all method to this task. One approach is to use unsu- pervised statistical techniques like Principal Component Analysis (PCA) to attain this. In this research, we employed domain knowledge, a manual dimension-reduction technique to carefully select principle features for the task. This search problem aimed at minimizing collinearity and model misfit by removing correlated features and noise. Ensemble Selection was performed on Categorical attributes Sub-District, Province. Pro- posed by Niculescu-mizil et al. [67], we limited OneHotEncoding to the top-10 levels for Chapter 3. Materials and Methods 21 Table 3.3: COVID-19 Dataset definition. sn Variable DataType Description 1 Target String A confirmed PCR COVID-19 test result 2 Age Integer Participant recorded age (in years) at time of COVID-19 PCR test 3 Gender String Participant recorded gender 4 Fever/Chills boolean Deduced symptom from statements infer- ring absence/presence of fever 5 Cough boolean Deduced symptom from statements infer- ring absence/presence of cough 6 Sore Throat boolean Deduced symptom from statements infer- ring absence/presence of sore throat 7 Shortness of Breath boolean Deduced symptom from statements infer- ring absence/presence of shortness of breath or difficulty in breathing 8 Diarrhoea boolean Deduced symptom from statements infer- ring absence/presence of diarrhoea 9 Joins/Muscle Pains boolean Deduced symptom from statements infer- ring absence/presence of joint and muscle pains 10 Malaise boolean Deduced symptom from statements infer- ring absence/presence of malaise 11 Fatigue/Lethargy boolean Deduced symptom from statements infer- ring absence/presence of fatigue or lethargy 12 Influenza boolean Deduced symptom from statements infer- ring absence/presence of influenza, com- mon cold and sneezes 13 Vomiting/Nausea boolean Deduced symptom from statements infer- ring absence/presence of vomiting or nau- sea 14 ComorbidityYN boolean Deduced from statements indicating ab- sence/presence of any underlying comor- bidity these categorical attributes. By mapping categorical data onto a binary scale, the OneHo- tEncode process involves converting each categorical value into distinct attributes consist- ing of ‘1‘ or ‘ 0‘ denoting a presence or absence of a level. Demographic variables Hospi- talization Status: ’Y’, ’N’ and Gender: ’M’, ’F’ were label encoded to 1, 0 denoting a Positive and Negative response respectively. For both datasets, the target (dependent/identifier) variable was classified as binary with ‘1‘ and ‘0‘ denoting an observed positive and negative clinical outcome respectively. Computational and time complexities were minimized by standardizing all features on a continuum to fit between Zero to One using the MinMaxScaler implementation through Chapter 3. Materials and Methods 22 the SKLEARN library. This is denoted by: Xscaled = X − Xmin Xmax − Xmin (3.1) where Xscaled : the new transformed vector X : the vector instance to transform Xmin : the minimum value of X in the vector domain Xmax : the maximum value of X in the vector domain To minimize data leakage, the training and out-of-sample datasets were engineered in- dependently through SKLEARN fit-transform methods. Because of the skewness in dis- tribution, missing values in categorical features were imputed with the most-frequent ob- servations. Because of the adequate trade-off between precision of imputation without distorting the structure of the data [68], features on a continuum were imputed using the Nearest Neighbor strategy. In the COVID-19 analytical dataset, we probed for possible patterns with missing data though we did not find any positive findings. We summarily concluded that missing data was completely at random. We therefore imputed missing values for age using the Nearest Neighbor imputation strategy and gender using the most frequent strategy [69]. 3.7 Model Specification In the model specification phase, the training dataset is used to learn the underlying al- gorithm [70] before iteratively parsing it until a best-fit model that predicts the outcome is derived. This stage involved modeling and optimizing to achieve scores that can be generalized. After preparing the data (data preparation phase), the classification models and their unary derivatives were then subjected to an iterative split-train-predict process to obtain generalized scores. The process is briefly explained below. 3.7.1 Splitting For this research, generally acceptable ML data partitioning schemes are applied. Firstly, the datasets are divided into two portions, stratified on the Target variable (clinical out- come) with (1/4) as out-of-sample and the rest (3/4) for model specification. Chapter 3. Materials and Methods 23 3.7.2 Classification Strategies In the classification and prediction of COVID-19 and Malaria clinical outcomes, we used three strategies. Firstly, we under-sampled the majority class to obtain equal representa- tion of clinical outcomes for both positive and negative groups. In the second approach, we accounted for the distribution imbalances in the data in the classification and predic- tion of clinical outcomes. The third approach, a novelty technique involved classifying the minority category as outliers and then predicting labels that presumably belong to this class using out-of-sample data. 3.7.3 Hyper-parameter selection To define a generalized model that can be deployed on out-of-sample data, we defined initial possible parameters for the predictions. Using, a stratified Parameter Grid search approach, the models were continuously deployed and refactored on all possible combi- nations using a 5-fold Cross-Validation scheme. Parameter combinations yielding the best F1-Score were identified and subsequently re-fit to define a generalized model. This ap- proach was repeated for model specification on under-sampled (balanced) data and mod- els accounting for distribution weights. 3.7.4 Support Vector Machines SVM models are trained to distinguish and segregate all instances of a unit class versus the rest. Using the concept of margins8, SVM aims to find an optimal hyperplane that best separates the data groups [53]. In an ideal scenario, data is separable i.e. positive cases distinguished from negative cases. We can then draw a hyperplane to achieve this by identifying support vectors9 to distinguish the separation. SVM approaches seek to predict labels by finding parameter functions that maximize the margin. This classifier naturally avoids over-fitting and bias problems by choosing a fitting less complex function yielding minimum training errors; a technique called regularization. In Figure 3.5 is a high-level illustration of the SVM classification technique. Notation: Assume a set of N training examples that can be identified as belonging to either class say T and F. Let these examples take on values +1 and −1 respectively with each data point xi having several of attributes K. Then, the training data takes the form (xi, yi) where i = 1, ..., N and yi ⊆ {+1,−1} and x ⊆ RK where R is an instance of K as illustrated in Figure 3.5. This implies yi = +1 if xi ⊆ T and yi = −1 if xi ⊆ F. For linearly separable data, the 8The distance between a hyperplane and the closest points; also called support vectors 9Data points on the lines that define the margins Chapter 3. Materials and Methods 24 Figure 3.5: SVM classifier a case of linearly separable data hyperplane assumes the function wTx + b = 0 (3.2) where w = Normal vector that is perpendicular to the plane b = Bias determining point location relative to the origin. Then, SVM searches for a separating hyperplane by maximizing 1 |w| . Therefore, new data points xi can then be classified using the decision rule expressed as f (x) = sign(wTx + b) (3.3) For linearly separable data, SVM introduces a slack variable εi for the constraints and assigns a penalty to data points on the wrong side of the hyperplane. A general expression of the function for linearly separable data can then be expressed as ∀iyi(wTx + b)− 1 + εi ≥ 0 (3.4) However, there are instances when the data can only be separated by a curved decision boundary i.e not separable linearly. In such cases, SVM assumes a soft margin to re-classify data points that were wrongly classified and introduces a penalty C for this misclassifica- tion by using the linear separation technique. The SVM then assumes a function min 1 2 |w|+ C N ∑ i=1 ∀iyi(wTx + b)− 1 + εi ≥ 0 (3.5) Being a classification task we adopt the Support Vector Classifier (SVC), a variation of Chapter 3. Materials and Methods 25 the SVM to classify and predict Malaria and COVID-19 clinical outcomes. Furthermore, because of time complexities involved with the algorithm, the kernel hyper-parameter was set ’linear’. 3.7.5 The k-Nearest Neighbor Method This is a non-parametric classification10 technique that relies on readily available data to predict classes on new data. Novel labels are predicted based on properties shared with other data points i.e an object is classified by the popular vote of its neighbors (K). KNN is dependent on the distance function used to measure similarity [57] between data instances i.e. proximity of K data points. With K=1, all new data points will be classified according to properties shared by their immediate neighbor. However, misclassification errors (false positives and false negatives) arise with larger values of K. For example, a Malaria-positive case may be labeled negative if the dataset contains the majority of such (negative) cases. Therefore, The optimal size of K is the one that minimizes the classification error. Illustrated in Figure 3.6, instance ’N’ will be classified by defined characteristics of either 3 or 7 nearest Neighbors. With KNN, Figure 3.6: KNN classifier proximity is estimated by the Euclidean distance function D(x, y) = √ m ∑ i=1 (xi − yi)2 (3.6) Where x = x1, ..., xm , y = y1, ...yn and and m = the attribute values of two points x and y 10Methods that make statistical inferences without regard to any underlying distribution. Chapter 3. Materials and Methods 26 3.7.6 Decision Tree Learning: Random Forests These use the concept of decision trees to classify data into discreet classes by using a set of rules. These set of rules iteratively split the data on key attributes i.e characteristics that most separate the data until further splits are not more informative. A decision tree cascades a set of rules, where each rule may infer either another rule or lead to a decision. Recursive algorithms like Iterative Dichotomizer 3 are used to construct decision trees. In the analogy of a forest, numerous decision trees collectively form Random Forests. In the learning process, at each node, the algorithm gains information that best discriminates labels. A form of dimensionality reduction, information-gain can also be used for feature selection where each candidate feature is evaluated in the context of the target. This in- formation is then used by the child node in a cascade manner until a decision is reached. Figure 3.7 illustrates the RF classifier. Notation: Assume a dataset S, with attribute A, let Sv denote an instance/subset of the data; Sv ⊂ S. Also let A = v and ValuesA is the set of all possible values of A, then Information gain11 can be expressed using the function below Gain(S, A) = Enthropy(S)− ∑ vϵValues(A) |Sv| |S| Enthropy(Sv) (3.7) where Entropy is the measure of uncertainty of a random variable A Figure 3.7: Decision Tree classifier branch 11A measure of this change in entropy Chapter 3. Materials and Methods 27 3.8 Novelty Detection Approaches The ML classification models discussed so far have one common assumption; the data to some degree has equal representation in the outcome hence balanced. However, this may not always be the case. Epidemiological studies have shown certain morbidities occur less frequently than others. In such cases, the surveillance and/or clinical datasets tracking these conditions are by far more likely to be biassed towards less frequent occurrences. In ML, scarcity of a particular class label in a prediction dataset is what defines anomalies [71]. Anomalies have two distinct quantitative properties; consist of fewer instances in the dataset and have peculiar characteristics (data values) compared to the majority (normal) instances. In principle, novelty detection focuses on the identification of abnormal patterns from large amounts of normal data [72]. As Sun, Wong, and Kamel [73] argue, classifica- tion rules that predict the small classes tend to be rare or undiscovered. As a result, out- of-sample data belonging to small classes are likely to be misclassified compared to those belonging to the majority class. Under those circumstances, deviations from the ’normal’ provide an alternative to model the minority class as anomalies, a technique sometimes called Novelty detection of Unary/One-class classification. Whereas kernel optimization methods, statistical approaches, and Neural Networks [74] amongst other strategies are available for anomaly detection, in this research, we focus on two strategies i.e iForest and OCSVM. 3.8.1 Isolation Forest Proposed by Liu, Ting, and Zhou [75], the iForest algorithm recursively partitions in- stances randomly until all possible instances are completely isolated. The approach uses a Binary Search Tree algorithm to construct isolation Trees (iTree) from randomly selected attributes, collectively forming an iForest. Assume X = {x1, ..., xn} is a sample of the data with N training examples, an iTree is constructed by recursively splitting X by randomly selecting an attribute q and split value p until (a) the tree has reached a height limit or (b) |X| = 1 or (c) all data in X have the same values. This approach creates shorter paths from the node and is independent of distance or density measures. Chapter 3. Materials and Methods 28 3.8.2 OneClass SVM In OCSVM, data presumed to originate from the normal class is used to train the support vector model, after which the model is tested on contaminated12 data to ascertain perfor- mance metrics for the segregation. Originally developed for two-class classification tasks, extensions and enhancements to the SVM like Support Vector Data Description [76] and Local-Density OCSVM [77] have been proposed with empirical results suggesting a better performance of the classifiers compared to the original OCSVM. 3.9 Learning Criteria 3.9.1 Contingency Table Metrics To fully understand model evaluation using supervised binary classifiers we contextualize all possible results in a 2X2 table illustrated in Figure 3.8 below. Sometimes referred to as an Error matrix or Confusion Matrix, we deduce statistics that inform classifier performance from two possible cases i.e true classifications and miss-classifications. We enumerated these two possible scenarios as TP: True Positives - correctly predicted labels, TN: True Negatives - correctly predicted negative labels, FP: False Positives - incorrectly predicted negative labels (Type-I error) and FN: False Negatives - incorrectly predicted positive la- bels (Type-II error). Figure 3.8: Error Matrix Firstly we examine accuracy, an estimation of how well an algorithm discriminates unseen instances. Often expressed as a percentage, it is computed by dividing the total count of correctly classified instances by the overall predictions. Whereas Accuracy as a metric was used to estimate model performances on out-of-sample data, the metric is prone to distribution bias since it is reliant on the relative class balance of the outcome. In this scenario, a more robust variation of the Accuracy score is to use the Mathews Correlation 12In the sense refers to a dataset where the wholesome containment of the majority class is altered by presence of a minority class. Chapter 3. Materials and Methods 29 Coefficient (MCC) that takes into account all four quantities of the confusion matrix [78, 79]. To quantify the predictive capacity of a classifier, we estimate the Precision i.e the fraction of predicted true cases out of total true cases. Expressed by dividing the number of all correctly predicted outcomes by the total number of outcomes positive predictions. This is sometimes referred to as Positive Predictive Value (PPV). Moreover, in clinical diagnoses, it is much tolerable to commit Type-I errors as opposed to Type-II errors. False Negative results are far more fatal compared to False Positives. We then adopt a measure that helps us prove right on all positive instances [60]; Sensitivity/Recall. Sometimes referred to as True Positive Rate 13 is the total number of relevant/positive results correctly classified by the algorithm. High recall values are indicative of low Type-II error; low FN counts. Therefore, better clas- sifiers should have both a high degree of precision and recall. Thus to ease interpretation, the F1-score14 was used to estimate classification performance of the algorithms. These performance metrics are summarised in Table 3.4 below. An alternative performance mea- sures used is Specificity/True Negative Rate (TNR), essentially the number of observations predicted as Negatives out of total Negative classifications. However, as with any predic- tive model, wrong predictions (misclassification) are a reality and we need a measure to quantify wrong predictions by classifiers. The Misclassification Rate; a measure of falsely classified data by versus total classifications is then used to quantify this error. Let ŷi be the prediction of data point i for label yi, then the error rate can be defined as miscn = 1 n ∗ ∑ i (yi ̸= ŷi) (3.8) In Table 3.4 below is a summary of performance metrics used in this research Table 3.4: Evaluation measures for the Confusion Matrix Metric Expression Accuracy TP + TN / ((TP + FP + TN + FN)) MCC (TP * TN) - (FP * FN) / sqrt(((TP + FP)(TP + FN)(TN + FP)(TN + FN))) Precision/PPV TP / (TP + FP) Sensitivity/Recall TP / (TP + FN) F1-Score 2 * (Precision * Recall) / (Precision + Recall) Specificity TN / (FP + TN) 13A measure os a classifiers’ completeness 14The weighted harmonic mean the algorithms’ Precision and Recall Chapter 3. Materials and Methods 30 3.9.2 Area Under curve (AUC) In addition to the confusion Matrix, we need a way to visualize, organize and select clas- sifiers based on their performance as compared to classifiers with No Skill. We define No skill classifiers as models that predict by chance i.e predictions are not in- formed by any prior patterns in the data. With a balanced binary outcome, we obtain post estimation statistics using ROC curves. This is a graphical representation of the trade-off between FN and FP rates for every possi- ble cut-off and is useful for visualising Recall/sensitivity and specificity. Better classifiers present curves more proximal to the top-left corner in the ROC space. This interpretation is however very subjective and quantifying these would prove a more meaningful ratio- nale. Ranging from 0 to 1, the AUC can be used to quantify the discrimination capacity of the model with an AUC of 0.5 suggesting otherwise. With Novelty detection, class-imbalances15, ROC curves no longer give reliable estimates of model performance. Similar to ROC, PR curves are used to estimate models’ perfor- mance. As illustrated in Figure 4.10, the goal is to be in the upper-right-hand corner [80, 81]. PR curves provide a way to summarize the trade-off between TPR and PPV value for a predictive model using different probability thresholds [82]. 3.10 Ethics The ethical and methodological aspects of this research were approved by the University of Witwatersrand Human Research Ethics Committee (M200509) and NHLS Academic Af- fairs and Research Office (28 September, 2020). No human subjects were involved. Surveil- lance data from the NMC data warehouse was extracted, de-identified, and made available in a compressed and encrypted WinRAR16 format. All computational experiments were conducted on a bit-lock-encrypted personal laptop only accessible by the researcher. 15A difference in the numbers of positive and negative instances 16A shareware file archiver and data compression program https://www.win-rar.com 31 4 Results In this Chapter, we present the results of the COVID-19 and Malaria classification models from the NMC surveillance data. We start by describing the population, then findings from preprocessing and lastly prediction results according to the classification strategies defined in Section 3.7. We also briefly report on patterns we found interesting from procedures taken. 4.1 Descriptive Statistics 4.1.1 Malaria Analytical Data This section presents descriptive statistics for the Malaria out-of-sample analytical dataset. Firstly, we examine the distribution of clinical outcomes before and after data pre-processing. (a) Before preprocessing (b) After preprocessing Figure 4.1: Distribution of Malaria clinical outcome In the raw (unprocessed) Malaria dataset, the positive-negative clinical diagnosis ratio was 100:268 from 216408 observations. As illustrated in Figure 4.1 above, after preprocessing, this distribution was distorted. Of the total 40557 observations considered for analysis, clinically diagnosed Malaria-positive cases accounted for 94.1% (n=38162) with the nega- tives contributing 5.9% (n=2395); a positive-negative ratio of 159:10. In the Malaria analyt- ical dataset, we only considered complete cases from 7 features of which 58.35% were male Chapter 4. Results 32 with the rest female. It was also observed that population age at test followed a bimodal distribution. Up to about 20 years, there were more Malaria positive than negative cases. However, as illustrated in Figure 4.2, between the ages of 20 to 45 years, there was more Malaria negative than positive cases. Figure 4.2: Age distribution at Test Date We therefore, adopted an approach similar to the WHO reporting standards by creating age categories according to the WHO [2] Malaria risk population age groups. The 15-49 year-old population contributed the largest number of observations (approximately 58%, n=23512) with the older population i.e those over seventy years of age accounting for the least (1%, n=591). Looking at the testing periods, 17.6% (n=7145) tests were done in January with the least done in July. We then investigated the relationship between age at sample test and calendar month when the test was done. From Figure 4.3, we observe the following; Those up to the age of 60 were likely to have a Malaria test between April and August. There was no clear period in which the older population (above 60 years of age) was likely to have a Malaria test. Months were extracted from the sample collection dates, after which we categorized them into four bands according to the South-African weather season. From Figure 4.4, we observed a fairly even number of samples tested done during autumn1 months (n=14,443) and in the summer2 (n=14,440). Descriptive summaries are presented 1March, April, May 2December, January and February Chapter 4. Results 33 Figure 4.3: Average monthly tests by age (years) Figure 4.4: Malaria tests done per Season Calendar in Table 4.1 below. The Malaria analytical dataset comprised of four locality attributes namely province, dis- trict, sub-district and health-facility. We observed less variability in the sub-district and health- facility and therefore not likely to add predictive power to the classifiers, hence considering only province and district as features. Illustrated in Figure 4.5, we observed Limpopo and Mpumalanga provinces accounted for Chapter 4. Results 34 Table 4.1: Descriptive Summary of Malaria Dataset Malaria dataset file description (n = 40557) Characteristic Sub group Distribution: n(%) Clinical Test Result (Target) Positive 38162 (94.09%) Negative 2395 ( 5.91%) Gender Male 23666 (58.35%) Female 16891 (41.65%) Age group under5 6703 (16.53%) 5-14 5736 (14.14%) 15-49 23512 (57.97%) 50-69 4015 (9.9%) 70+ 591 (1.46%) Hospitalization Status In-patient 21378 (52.71%) Out-patient 19179 (47.29%) Red Blood Cell count Median (IQR) 4.37 (3.7, 4.91) Calendar Season at test date Autumn 14443 (35.61%) Summer 14440 (35.6%) Spring 8036 (19.81%) Winter 3638 (8.97%) Province reporting result Limpopo 19683 (48.53%) Mpumalanga 8425 (20.77%) Gauteng 6649 (16.39%) Kwazulu-Natal 2339 (5.77%) North West 1176 (2.9%) Western Cape 1096 (2.7%) Eastern Cape 519 (1.28%) Free State 515 (1.27%) Northern Cape 155 (0.38%) District where test was done Mopani 9624 (23.73%) Ehlanzeni 6847 (16.88%) Vhembe 6529 (16.1%) Ekurhuleni Metro 3554 (8.76%) Capricorn 1426 (3.52%) West Rand 1357 (3.35%) Waterberg 1161 (2.86%) Nkangala 1035 (2.55%) Ethekwini Metro 1006 (2.48%) Sekhukhune 943 (2.33%) *IQR: Interquartile Range There were 52 levels in the feature reporting District where the test was done. The last 10 districts with a reported test done were Joe Gqabi and John Taolo Gaetsewe had 13 observations each with Harry Gwala reporting 12 tests done. Uthukela, Amajuba, Zf Mgcawu, Umzinyathi, and Namakwa had 11, 9, 8, 7, and 3 observations respectively. Both Xhariep and Central Karoo had 2 observations each. Chapter 4. Results 35 approximately 70% of the analytical dataset (n=28108) with less than 1000 tests done in Northern Cape province. Because of the extended levels in the attribute district, only the 10-most-frequent districts reported were considered and used for model specification. Figure 4.5: Malaria tests done per province 4.1.2 COVID-19 Analytical Data The distribution of clinically diagnosed COVID-19 positive-negative ratio was 162:100. Be- cause no observations were dropped from the COVID-19 raw dataset, the before/after preprocessing distribution of the clinical outcome are identical. Figure 4.6: COVID-19 clinical outcome (raw dataset) Of the 14 features considered for the analytical dataset, two contained missing values i.e 2.1% (n=739) in age and 0.1% (n=35) in gender. As illustrated in Figure 4.7, between ages of 10 and 90 years, it was observed that the COVID-19 negative population was slightly older compared to those with a positive; 58 Chapter 4. Results 36 years (IQR 32.0, 53.0) vs 43 years (IQR 31.0, 54.0). However, suspected COVID-19 cases at the time of sample collection were on average 31.5 years old (IQR: 31.0 - 53.0). Figure 4.7: COVID-19 age distribution of the population It was also noted that not all suspected cases registered symptoms as would be expected from such a highly infectious condition. Out of approximately 35,000 tests done, Sore Throat was reported in about 25% (n=8807) of the population. Although cough, fever, and malaise have been reported in COVID-19 cases, these symp- toms were less frequently reported in the COVID-19 surveillance as reported in Table 4.2. Figure 4.8 illustrates the logarithmic distribution of recorded symptoms in the COVID-19 analytical dataset with descriptive summary statistics of the unstratified COVID-19 ana- lytical dataset in Table 4.2 below. 4.2 Predicting Probable Cases To determine probable cases for Malaria and COVID-19, first, we investigated relation- ships between all numerical features to the target using both correlation analysis an chi- square.From the correlation analysis using the Malaria analytical dataset, we observed to a considerably large extent weaker relationships between features and the target. However, the inter-feature correlation between gender and red Cell counts was positive (r = 0.18). Chapter 4. Results 37 Table 4.2: Descriptive Summary of COVID-19 Dataset COVID-19 dataset description (n = 35202) Characteristic Sub group Distribution: n(%) Clinical Test Result (Target) Positive 21795 (61.91%) Negative 13407 ( 38.09%) Gender Male 11559 (32.87%) Female 23608 (67.13%) Age groups Below 60 years 27676 (86.1%) 60 years and above 4788 (13.9%) Fever/Chills/Pyrexia Absent 35175 (99.92%) Present 27 (0.08%) Cough Absent 35165 (99.89%) Present 37 (0.11%) Sore Throat Absent 26395 (74.98%) Present 8807 (25.02%) Shortness of Breath Absent 35192 (99.97%) Present 10 (0.03%) Diarrhoea Absent 35199 (99.99%) Present 3 (0.01%) Muscle or Joint aches Absent 35195 (99.98%) Present 7 (0.02%) Malaise Absent 35197 (99.99%) Present 2 (0.01%) Fatigue or Lethargy Absent 35201 (99.99%) Present 1 (0.01%) Flu Absent 35200 (99.99%) Present 2 (0.01%) Vomiting or Nausea Absent 35201 (99.99%) Present 2 (0.01%) Any Comorbidity Absent 25687 (72.97%) Present 9515 (27.03%) *IQR: Interquartile Range Recorded comorbidities included HIV/AIDS, Tuberculosis, Hypertension, Diabetes, Asthma, Obesity and Cancer, and Chronic Obstructive Pulmonary Disease (COPD) Chapter 4. Results 38 Figure 4.8: Frequency distribution of recorded symptoms on a log scale 4.2.1 Predictions using Balanced Datasets From results presented in Table 4.3, all three classifiers (SVC, RF, and KNN) scored equally on accuracy (94%). However, accuracies were lower when predicting clinical outcomes from COVID-19 out-of-sample data. The KNN scored lowest on accuracy (59%) whereas the SVC attained the highest predictive accuracy though these differences were marginal. In both Malaria and COVID-19 out-of-sample data, SVC yielded the highest on sensitivity (100%) compared to KNN and RF. Table 4.3: Performance Metrics on Balanced data (percentage scores on out- of-sample data). Malaria Covid-19 Performance Metric SVC RF KNN SVC RF KNN Accuracy 0.941 0.94 0.942 0.62 0.611 0.59 MCC 0 0.242 0.231 0.026 0.025 0.039 Sensitivity/Recall 1 0.99 0.994 1 0.944 0.826 Precision (PPV) 0.941 0.949 0.947 0.62 0.622 0.628 F1-Measure 0.97 0.969 0.97 0.765 0.75 0.714 Classifier predicted values are presented in Table 4.4 below. At 97%, the SVC, RF, and KNN classifiers scored higher F1-measure when using Malaria out-of-sample data compared to COVID-19; 76%, 75% and 71% respectively. The classifiers generally scored high in predicting positive outcomes averaging at 94% Chapter 4. Results 39 with the Malaria out-of-sample data. Moreover, compared to COVID-19 out-of-sample data, PPV scores were generally lower compared to Malaria data; 62.0% for SVC, RF at 62.4%, and KNN 62.8%. Table 4.4: Confusion Matrices for Malaria and COVID-19: Balanced Data Malaria COVID-19 Model/Classifier TN FN TP FP ER TN FN TP FP ER SVC 0 0 9541 599 0.059 9 3 6536 4013 0.38 RF 89 98 9443 510 0.06 275 366 6173 3747 0.389 KNN 70 61 9480 529 0.058 827 1139 5400 3195 0.41 ER: Error Rate The number of correct and incorrect predictions by SVC, RF KNN stratified by category. Using AUC are performance metric, we observed a generally better performance in pre- dicting clinical outcomes using Malaria out-of-sample data compared to COVID-19 where the SVC, KNN, and RF classifier predicted about 20% more accurately with Malaria data. These results are presented Figure 4.9 below (a) Malaria (n=1389 per class) (b) COVID-19 (n=8938 per class) Figure 4.9: Classifier performance in ROC space Chapter 4. Results 40 4.2.2 Predictions using Weighted Datasets: Imbalanced Learning In this approach, the models were refitted on the same sample data this time accounting for distribution weights of clinical outcomes in model specification. Observing results presented in Table 4.5, we observe that the SVM yielded a recall score of 100% on both datasets i.e. the model did not predict a single TN outcome from the 599 observations. The same poor prediction was observed in the COVID-19 data where the SVC classifier was only able to accurately predict 9 out of 4021 clinically negative ob- servations. We also observe the SVC classifier predicted the highest number of clinically positive observations (n=9541). Table 4.5: Performance Metrics on Weighted data (percentage scores on out- of-sample data) Malaria Covid-19 Performance Metric SVC RF KNN SVC RF KNN MCC 0 0.227 0.239 0.026 0.084 0.032 Sensitivity/Recall 1 0.986 0.992 1 0.486 0.786 Precision (PPV) 0.941 0.949 0.948 0.62 0.663 0.628 F1-Measure 0.97 0.967 0.969 0.765 0.568 0.698 The RF model for COVID-19 had the lowest sensitivity with 48.6%. Accounting for tar- get distribution weights, all three classifiers (SVC, RF, and KNN) scored higher PPV with Malaria compared to COVID-19 at 97% across. Results from error matrices are presented in Table 4.6 below. Table 4.6: Confusion Matrices for Malaria and COVID-19: Weighted Data Malaria COVID-19 Model/Classifier TN FN TP FP ER TN FN TP FP ER SVC 0 0 9541 599 0.059 9 3 6536 4013 0.38 RF 94 134 9407 505 0.063 2371 3291 3248 1651 0.468 KNN 81 80 9461 518 0.059 927 1399 5140 3050 0.42 ER: Error Rate The number of correct and incorrect predictions by SVC, RF KNN stratified by category. With F1-measure as primary metric, we observe that SVM outperforms RF and KNN clas- sifiers when modeled using COVID-19 data, 76%, 57% and 70% respectively. However, the scores remained similar (97%) when the models were subjected to the Malaria data. Chapter 4. Results 41 Taking into account imbalances in the class distribution, we present the AUC results over the Precision-Recall space in Figure 4.10. (a) Malaria Positive:Negative (1590:100) (b) COVID-19 Positive:Negative (162:100) Figure 4.10: Classifier performance in Precision-Recall space We observe higher predictions in Malaria out-of-sample data compared to COVID-19. The RF AUC was higher with Malaria out-of-sample data testing (98%) versus to 67% with COVID-19. Though marginally lower compared to RF, AUC for SVC was still higher with the Malaria data. The same is observed with KNN classifier where the yield was 98% using Malaria data. The same algorithm correctly predicted 65.5% on COVID-19 outcomes using out-of-sample data. 4.3 Novelty Detection results For novelty detection, we first examined the distribution of the target in both the train- ing and out-of-sample data. We observed severe skewness in Malaria data. The posi- tive/negative ratio was 16:1 in both the training and out-of-sample Malaria data. Negative observations (minority) were categorized as outliers to predict this category in the out-of- sample data. In Table 4.7 below, we present results from unary classification approaches. Table 4.7: Performance metrics using Unary classification on Malaria data (percentage scores on out-of-sample data) Performance Metric OCSVM iForest MCC -0.05 0.037 Specificity/TNR 0.232 0.09 F1-Measure 0.071 0.092 Chapter 4. Results 42 We evaluate classifier performance in predicting 599 out-of-sample malaria-negative ob- servations. Results are presented in the Confusion Matrix in Table 4.8 below. Worth not- ing, OCSVM predicted more negative observation compared to the iForest; 139 versus 54 respectively hence a higher specificity score. Table 4.8: Confusion Matrix from Unary classification: Malaria data Model/Classifier TN FN TP FP OneClassSVM 139 3162 6376 460 iForest 54 517 9024 545 43 5 Discussion In this Chapter, we discuss empirical findings to the research objectives proposed in Chap- ter 1. We present both a quantitative as well as a qualitative discussion of the empirical data reported in Chapter 4. In the last section is a brief discussion of limitations encoun- tered during the research. 5.1 Malaria and COVID-19 Surveillance Data Profiles Between January 2015 and December 2019, Malaria prevalence stood at 27% (n=58,692). Among those clinically diagnosed with Malaria, 57.2% (n=33,198) were from Limpopo province followed by Mpumalanga (17%, n=9,623). Northern Cape had the least Malaria cases. Two possible explanations support this result. First, the climatic conditions fa- vor Malaria prevalence in this area. Secondly, being border provinces, Mpumalanga and Limpopo have a higher immigrant population from neighboring countries of Mozambique and Zimbabwe which are Malaria-endemic. These findings are in agreement with statistics reported in Guidelines for the Treatment of Malaria in South Africa [24]. From the COVID-19 summary statistics reported in Table 4.2, fever was prevalent in less than 1% (n=27) of the population; 8 observed in COVID-19 positive population with 19 in the negative population. The same was observed with cough (n=37) and fatigue/lethargy. The most prevalent symptom in the population with clinical suspicion of COVID-19 was sore throat (25%, n=8,807). Among the COVID-19 positive cases, sore throat was prevalent in 25% (n=5,450) of the population. On investigating the relationship between COVID-19 clinical outcomes and presence/absence of sore throat, we did not seem to find statistical evidence in the data to suggest existence of an association between COVID-19 clinical outcomes and the pres- ence/absence of sore throat (p-value = 0.94). In disease surveillance, attention is generally accorded to patients with a clinically positive as opposed to a negative outcome. This variation in capturing data creates gaps in data over time. Because of this, we considered only complete-cases for the Malaria analytical Chapter 5. Discussion 44 dataset. Several observations were dropped, of which, the majority had a negative clin- ical outcome, reducing the dataset to 40,557 observations. Therefore, reporting stratified statistics from this biased sample would lead to over-estimating Malaria prevalence. We investigated the association of risk-factors with clinical outcomes. Although people of all ages are at risk, the older population i.e those above sixty years of age are more suscep- tible to COVID-19 infection [83]. However, in the analytical dataset, the older population accounted for 16% (n=3422) of the 21,795 COVID-19 positive cases with the majority un- der sixty years of age. A 2020 systematic review by Yang et al. [84] suggests an association between age and comorbidities among COVID-19 patients; suggesting age and comorbidi- ties as risk factors. The authors reported comorbidities to be more prevalent in high-risk populations i.e older patients who report the presence of at least one comorbidity. In this study, among those who tested positive for COVID-19, comorbidities were regis- tered 27.2% (n=5,935) of the population. This prevalence was similar among those who tested negative for COVID-19 (26.7%, n=3,580). Using Mantel-Haenszel estimates we ob- served that those who registered at least one comorbidity had the same risk of testing posi- tive for COVID-19 as those who tested negative (OR1 = 1.02; 95% CI2: 0.98, 1.08). However, adjusting for the effects of age, the older population had a 20% (OR = 1.20; 95% CI: 1.06, 1.36) increased risk of testing positive for COVID-19 compared to the younger population. Noteworthy, out of the 35,202 clinically suspected COVID-19 cases in this research, we ob- served 6.4% (n=2,256) of the population were high-risk i.e sixty years or older, and had at least one comorbidity registered. Although we found overwhelming statistical evi- dence suggesting an association existed between gender and COVID-19 clinical outcomes (p-value < .01), we did not find any epidemiological evidence to support this finding. 5.2 Classification and Prediction of Clinical Outcomes Malaria and COVID-19 We applied three different approaches to predict probable cases for Malaria and COVID- 19. The first two approaches to classification and prediction involved resampling the data and modeling solutions with the data as-is but accounting for distribution weights of the target. In these two approaches, our primary measure of performance was the AUC. In the third, a novelty approach, specificity scores were the primary evaluation metric. Three models were subjected to data in which the target had equal representation. With 1389 observations in each class, the SVM, RF, and KNN models performed better with 1Odds Ratio 2Confidence Interval Chapter 5. Discussion 45 Malaria compared to COVID-19 out-of-sample data. Though samples in each class were incomparable, 1389 for Malaria and 8939 for COVID-19, this difference did not seem to alter the predictive power of models. As illustrated in Figure 4.9, at 80%, the RF classi- fier recorded the highest predictive power compared to SVM and KNN (75.4% and 78.8% respectively). We also observed, with COVID-19 out-of-sample data, the models did not perform any better compared to guess i.e models with no-skill. As illustrated using Precision-Recall and ROC curves, we observed an overall improve- ment in classification and prediction when classifiers account for distribution weights of the target in predictions. However, scores with COVID-19 data though remained lower compared to Malaria data. Using COVID-19 data, significant improvements were noticed with the RF, KNN, and SVC classifiers. The RF AUC improved from 56.1% to 67.3% and the KNN from 53.7% to 65.5% and the SVC from 50.7% to 62.9%. This twelve percent gain may be attributed the increased number of observations that the models were trained on, hence learning more from the data to improve predictions. We also observed the predic- tion error rate was generally higher with RF classifier at though compared to KNN and SVM, this difference was marginal. In consideration of distributions in the target as presented in Section 4.1, novelty ap- proaches were employed only on malaria data. Comparing classification and prediction of negative observations, the OCSVM performed better than the iForest. Out of 599 out-of- sample negative observations, the OCSVM predicted 139 correctly attaining a specificity score of 0.23 versus 0.09 (TN: 54). These results are unsatisfactory as highlighted prior by the correlation matrix in Section 36. 5.3 Qualitative Evaluation of Results While in this study we employed disease predictors in singularity, a viable option as pro- posed in literature is to consider features in combination. For example, in the predictive diagnosis of Malaria from symptoms, individuals reporting a fever and had a previous Malaria episode in their household are more likely to yield a positive Malaria result. This approach will likely increase the predictive power of models compared to models that consider symptom information distinctively. Whereas a clinical outcome was available in all observations, symptomatic information was less informative in the COVID-19 analytical dataset and completely absent in the Malaria dataset. On the contrary, while the Malaria dataset consisted of laboratory mark- ers (red blood cell counts), this information was completely lacking in the COVID-19 data. Chapter 5. Discussion 46 This dissimilarity in data structures was more pronounced in results where models run on out-of-sample data performed overwhelmingly better on Malaria data compared to COVID-19. Research has shown that red blood cell counts are an indication of infection in humans, explaining the variation in prediction results. When implementing supervised ML predictive models, it is necessary to identify before- hand relationships between features and targets as well as inter-feature correlations. In both Malaria and COVID-19 analytical datasets, we identified weaker correlations between features and targets. Scant prevalences made it difficult to deduce informative correlations between clinical outcomes with selected features. For this reason, we could not determine distinct predictors in the COVID-19 dataset and to a fair amount in the Malaria dataset. One approach to mitigate this pitfall is to employ comprehensive data quality assessments at data-collection points. Embedding automated integrity and validation checks in data collection instruments enforces the tidy-data model proposed by Wickham [65] described in Chapter 3. In this research, we described the current surveillance data structures and profiles for Malaria and COVID-19 at the NHLS, South Africa. While clinical outcomes were observed for all observations, other attributes in the surveillance datasets were inconsistent with gaps. For example, between 2015 and 2019, disease symptoms, treatment information, travel and contact history, and case notes were not recorded in up to 100% observations in Malaria surveillance data. This hugely limited how much information we were able to use for classification and prediction. In a similar way, among the demographic attributes observed in the COVID-19 surveillance data, location information was lacking. Therefore, stratified analysis to determine cases per province to inform the DoH on resource alloca- tion for C