Computational approaches to characterizing morbidity and mortality patterns in rural South Africa
| dc.contributor.author | Mapundu, Michael Tondera | |
| dc.contributor.supervisor | Celik, Turgay | |
| dc.date.accessioned | 2025-11-04T08:18:47Z | |
| dc.date.issued | 2024 | |
| dc.description | A research report submitted in fulfillment of the requirements for the Doctor of Philosophy, in the Faculty of Health Sciences, School of Public Health, University of the Witwatersrand, Johannesburg, 2024 | |
| dc.description.abstract | Background: Verbal autopsies (VAs) are commonly used in Low to Middle Income Countries, as a way of determining the cause of death in cases where deaths occur outside health facilities and there is no medically certified cause of death. The VA process is usually done by conducting interviews with relatives of the deceased to elicit information about circumstances and events surrounding the death. The compiled VA narratives are then given to two doctors, and supplemented by the full set of responses, both from structured questions for assessment, in order to reach a consensus on the cause of death. In instances where they disagree, a third physician is consulted, a process known as Physician Coded Verbal Autopsy (PCVA). PCVA is the most used process for determining cause of death. However, it is widely criticized because of its lack of robustness, cost, time, inconsistencies, and inaccuracies as it is subjective and prone to errors among many drawbacks. Therefore, these challenges affect the accuracy of verbal autopsy results. Consequently, this results in PCVAs mostly employed for the training and validation of computational approaches. Despite these challenges, VAs have been employed successfully to estimate mortality rates and causes of death in settings where vital registration systems are weak or non-existent. Therefore, efforts are ongoing to improve the validity and reliability of verbal autopsies, including the use of computational approaches for analysing the data. There has been a growing interest from the VA community to apply automated algorithms that are artificially intelligent in order to improve cause of death determination using VA data, thus closing the civil registration gap. It has been proven that the use of machine learning (ML) and natural language processing (NLP) has helped identify patterns and trends in the data that might be missed by manual analysis, but are key in transforming the data into actionable insights that can help improve health outcomes. Study Objectives: The overall aim of this study was to utilize advanced computational methodologies to gain a comprehensive understanding of the complex health dynamics within rural South African communities. The study aims to bridge the gap between traditional statistical and epidemiological approaches and the unique challenges faced by these communities, thereby contributing to more informed public health strategies and interventions. As such, we sought to understand the determinants and circumstances of events leading to cause of death in rural north-east South Africa, using predictive and descriptive analysis. The main focus of this study was to answer the following questions crucial in VA monitoring and decision making: 1) what are the common prevalent topical diseases that led to death at Agincourt Health and Demographic Surveillance System (HDSS) in rural north-east South Africa between 1993 and 2015?; 2) how is mortality clustered by cause of death within households, and what characteristics of households are associated with high mortality at Agincourt HDSS? and 3) to what extent can machine learning and deep learning techniques accurately classify cause of death as compared to physician classification? In the process of our investigation we will also address the following sub questions: 1) how can machine learning and statistical modelling be effectively applied to analyse and predict mortality patterns within rural South African communities?; 2) what are the key determinants and socio-economic factors that contribute to variations in mortality rates?; 3) can computational techniques uncover hidden correlations between specific diseases and socio-economic indicators, providing insights into potential causal relationships?; 4) how can spatial analysis techniques be used to identify clusters of high mortality rates in rural South Africa, and what underlying factors might be driving these patterns?; 5) what role do access to healthcare resources, healthcare infrastructure, and healthcare-seeking behaviour play in shaping mortality outcomes in rural South African communities?; 6) to what extent do traditional health beliefs, cultural practices, and community dynamics influence morbidity and mortality patterns, and how can computational methods account for these factors; 7) how effective are data-driven models in predicting future morbidity and mortality trends in rural South vi Africa, and how might these predictions inform healthcare planning and resource allocation? and 8) what are the potential barriers and opportunities for scaling up successful computational approaches to other rural regions within South Africa or similar global contexts? The specific objectives of this study were; 1. To determine the most prevalent diseases that led to deaths using VA narrative datasets and text mining techniques at Agincourt HDSS in rural north-east South Africa between 1993 and 2015. 2. To identify mortality clusters by cause of death, establish determinants of mortality clusters and investigate the mortality characteristics associated with households at Agincourt HDSS between 1993 and 2015. 3. To establish ML accuracy in automating VA classification and achieve at least the same level of accuracy with that of physician classification on cause of death in rural north-east South Africa between 1993 and 2015. One of the aims of this study was to apply text mining techniques to derive implicit knowledge that is hidden in the unstructured VA narratives and present it in an explicit form. This allowed us to discover, mortality causes and most prevalent diseases which caused the population to succumb to death. Secondly, we sought to establish ML accuracy in automating VA classification and achieve at least the same level of accuracy with that of physician classification on cause of death. This was done through a comparative performance evaluation of ML methods and Computer Coded Verbal Autopsy (CCVA) algorithms on South African VA narratives data. Additionally, we also explored with novel deep learning architectures in order to generate cause of death prediction in a timely, cost effective and error free way. These computational techniques will make us achieve our aim of determining events and circumstances leading to cause of death by identifying morbidity and mortality occurrences in rural north-east South Africa from 1993 to 2015. As such, the study will ease the design, development, implementation and sustainment of tailored health intervention programmes. Consequently, this will improve life expectancy, turnaround time for diagnosis, and enforce a standardised VA reporting approach. This will therefore close the civil registration gap. Method: This study was a secondary data analysis of routinely collected VA data at Agincourt HDSS, for the period of 1993 to 2015. Agincourt HDSS is a surveillance site that specifically provides evidence based health monitoring that seeks to strengthen health priorities, practice and inform policy. In this study, we used three types of datasets. The first dataset is the structured responses from the standard questionnaire, second dataset is the VA narratives, and the third dataset is a combination of the responses and the narratives. The three datasets had 287 columns/features and 16338 records/observations. For the responses only, we took all features that had responses from the standard questionnaire as our predictors and the cause of death assigned by physicians using International Classification of Diseases-10 (ICD-10) code for each record in the dataset as our target variable. Ultimately, we had 231 predictors (all symptoms, age at death and gender) and 1 target variable, and all our features were in English. The predictions using the narratives were done using age at death, gender and the narrative feature and 1 target variable. For the combined VA dataset we used 232 predictors and 1 target variable. We only added the VA narrative feature to the responses dataset in order to have our combined dataset. We further created twelve cause of death categories with corresponding labels, class distribution with number of samples for each class before and after data balancing for our training dataset. The cause of death categories were derived based on InterVA user guide. The text mining and deep learning studies used the narratives only dataset, and the ML study employed all datasets. vii Results: ML models could accurately determine the cause of death from VA narratives, producing results comparable to expert diagnosis, with our optimal models attaining accuracies around 96%, with significant statistical differences in algorithmic performance (p < 0.0001). In the same way our robust novel stacked ensemble deep learning methods (SEDL) performed optimally than conventional DL approaches attaining an accuracy of 82% and employed Local Interpretable Model-agnostic Explanations (LIME) to enhance the interpretability of DL models, thus fostering trust in their use in healthcare. Our empirical results suggest that our automated approaches can be integrated in the CoD pipeline for identifying mortality causes, alongside human annotation, and interpretation. Additionally, through mortality trend and pattern analysis, we discovered that in the first decade of the civil registration system in South Africa, the average life expectancy was approximately 50 years. However, in the second decade, the life expectancy significantly dropped, and the population was dying at a much younger average age of 40 years. This suggests that the HDSS population succumbed to death due to mortality causes such as; vomiting/diarrhoea, chest/stomach pain, fever, coughing and high blood pressure. Interestingly, we found out that the most prevalent diseases entailed human immunodeficiency virus (HIV), tuberculosis (TB), neurological disorders, malaria, diabetes, high blood pressure, chronic ailments (kidney, heart, lung, liver), maternal and accident related deaths. Noteworthy, in the third decade, we see a gradual improvement in life expectancy, possibly attributed to effective health intervention programmes. Our sequential modelling on patient care seeking patterns, suggests that most people in the HDSS seek traditional ways than western ways for their healthcare, when faced with terminal illnesses. Additionally, we noticed that the narratives entail additional variables which can ease the cause of death diagnosis using sequential modelling and semantic and structure analysis, with a retrieval rate of approximately n > 2, per every case where n is number of terms. Through a structure and semantic analysis of narratives where experts disagree, we also demonstrate the most frequent terms of traditional healer consultations and visits. Therefore, this can possibly assist in determining cause of death specifically in the unknown category. Conclusion: This research study explores the utilization of computational techniques to analyse and comprehend morbidity and mortality patterns in rural South Africa, by leveraging large-scale VA data and complex computationa approaches to discern prevalent diseases, health disparities, and mortality trends within this specific context. These computational approaches, avail nuanced insights into disease prevalence, risk factors, and potential correlations between socio-economic factors and health outcomes. The findings of this study flags the potential of computational approaches in uncovering intricate health dynamics in rural settings, shedding light on areas for targeted interventions and policy enhancements. The study enforces the significance of harnessing data-driven strategies to inform public health strategies tailored to the unique challenges faced by communities in rural South Africa. The study represents a significant advancement in the field of CoD determination from VA narratives by introducing innovative ML and DL techniques that offer accurate and interpretable results. The findings suggest the potential for these models to streamline the VA reporting process, ultimately benefiting healthcare systems in LMICs by reducing diagnosis turnaround time, costs and improving the accuracy of CoD determination. Therefore, this research bridges the gap between the amount of data available and conducting research that can lead to practical actions, thus supporting multi- disciplinary research in civil registration systems using VA data. Consequently, it provides a baseline for future studies, generalising these findings to other domains of interest, thus highlighting the importance of improving health intervention programs in LMICs to increase life expectancy, and contribute to the understanding of mortality patterns and prevalent diseases in LMICs by harnessing the power of computational techniques. | |
| dc.description.submitter | MM2025 | |
| dc.faculty | Faculty of Health Sciences | |
| dc.identifier | 0000-0002-2830-0692 | |
| dc.identifier.citation | Mapundu, Michael T. . (2024). Computational approaches to characterizing morbidity and mortality patterns in rural South Africa [PHD Thesis, University of the Witwatersrand, Johannesburg]. WIReDSpace. https://hdl.handle.net/10539/47337 | |
| dc.identifier.uri | https://hdl.handle.net/10539/47337 | |
| dc.language.iso | en | |
| dc.publisher | University of the Witwatersrand, Johannesburg | |
| dc.rights | © 2024 University of the Witwatersrand, Johannesburg. All rights reserved. The copyright in this work vests in the University of the Witwatersrand, Johannesburg. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of University of the Witwatersrand, Johannesburg. | |
| dc.rights.holder | University of the Witwatersrand, Johannesburg | |
| dc.school | School of Public Health | |
| dc.subject | UCTD | |
| dc.subject | Cause of Death | |
| dc.subject | deep learning | |
| dc.subject | LIME, | |
| dc.subject.primarysdg | SDG-3: Good health and well-being | |
| dc.title | Computational approaches to characterizing morbidity and mortality patterns in rural South Africa | |
| dc.type | Thesis |