4. Electronic Theses and Dissertations (ETDs) - Faculties submissions

Permanent URI for this communityhttps://hdl.handle.net/10539/37773

Browse

Search Results

Now showing 1 - 9 of 9
  • Thumbnail Image
    Item
    Machine learning in marketing strategy: A socio-technical approach in South Africa
    (University of the Witwatersrand, Johannesburg, 2024) Govender, Aleasha; Quaye, Emmanuel
    The purpose of this research study was to determine whether the existing market segmentation, targeting and positioning (STP) approaches are optimal for marketing strategy in South Africa, and to what extent AI and machine learning are being used to improve marketing strategy in South Africa. The methods used have drawn on qualitative data research and document analysis. There were 10 participants in the study, the industries include Banking, Telecommunication and Medical Insurance. The methods used have drawn on qualitative data research and document analysis. The key results of the research have determined that Machine Learning is in its inception phase in terms of being used in marketing strategy in corporate South Africa. The research further finds that there are factors that are slowing the development in this field that are aligned with both hard and soft capabilities, for example, along with infrastructural capabilities like software integration, strategic capabilities like interdepartmental alignment are required for effective deployment of these technologies. Further, the research finds that the current segmentation, targeting and positioning methods used in isolation are not optimally contributing to marketing strategy, rather a blended approach including insights from customer data will provide a more accurate STP strategy. This research supports marketeers, technologists, business structures, researchers in South Africa, as well as strategists who deal with mass consumer bases, because market segmentation, targeting and positioning underpin how marketing strategy is rolled out throughout corporate South Africa and AI and Machine Learning are emerging technologies that are highly topical and are only at the inception phase of optimal utilisation
  • Thumbnail Image
    Item
    Generating Rich Image Descriptions from Localized Attention
    (University of the Witwatersrand, Johannesburg, 2023-08) Poulton, David; Klein, Richard
    The field of image captioning is constantly growing with swathes of new methodologies, performance leaps, datasets, and challenges. One new challenge is the task of long-text image description. While the vast majority of research has focused on short captions for images with only short phrases or sentences, new research and the recently released Localized Narratives dataset have pushed this to rich, paragraph length descriptions. In this work we perform additional research to grow the sub-field of long-text image descriptions and determine the viability of our new methods. We experiment with a variety of progressively more complex LSTM and Transformer-based approaches, utilising human-generated localised attention traces and image data to generate suitable captions, and evaluate these methods on a suite of common language evaluation metrics. We find that LSTM-based approaches are not well suited to the task, and under-perform Transformer-based implementations on our metric suite while also proving substantially more demanding to train. On the other hand, we find that our Transformer-based methods are well capable of generating captions with rich focus over all regions of the image and in a grammatically sound manner, with our most complex model outperforming existing approaches on our metric suite.
  • Thumbnail Image
    Item
    Analyzing the performance and generalisability of incorporating SimCLR into Proximal Policy Optimization in procedurally generated environments
    (University of the Witwatersrand, Johannesburg, 2024) Gilbert, Nikhil; Rosman, Benjamin
    Multiple approaches to state representation learning have been shown to improve the performance of reinforcement learning agents substantially. When used in reinforcement learning, a known challenge in state representation learning is enabling an agent to represent environment states with similar characteristics in a manner that would allow said agent to comprehend it as such. We propose a novel algorithm that combines contrastive learning with reinforcement learning so that agents learn to group states by common physical characteristics and action preferences during training. We subsequently generalise these learnings to previously encountered environment obstacles. To enable a reinforcement learning agent to use contrastive learning within its environment interaction loop, we propose a state representation learning model that employs contrastive learning to group states using observations coupled with the action the agent chose within its current state. Our approach uses a combination of two algorithms that we augment to demonstrate the effectiveness of combining contrastive learning with reinforcement learning. The state representation model for contrastive learning is a Simple Framework for Contrastive Learning of Visual Representations (SimCLR) by Chen et al. [2020], which we amend to include action values from the chosen reinforcement learning environment. The policy gradient algorithm (PPO) is our chosen reinforcement learning approach for policy learning, which we combine with SimCLR to form our novel algorithm, Action Contrastive Policy Optimization (ACPO). When combining these augmented algorithms for contrastive reinforcement learning, our results show significant improvement in training performance and generalisation to unseen environment obstacles of similar structure (physical layout of interactive objects) and mechanics (the rules of physics and transition probabilities).
  • Thumbnail Image
    Item
    Learning to adapt: domain adaptation with cycle-consistent generative adversarial networks
    (University of the Witwatersrand, Johannesburg, 2023) Burke, Pierce William; Klein, Richard
    Domain adaptation is a critical part of modern-day machine learning as many practitioners do not have the means to collect and label all the data they require reliably. Instead, they often turn to large online datasets to meet their data needs. However, this can often lead to a mismatch between the online dataset and the data they will encounter in their own problem. This is known as domain shift and plagues many different avenues of machine learning. From differences in data sources, changes in the underlying processes generating the data, or new unseen environments the models have yet to encounter. All these issues can lead to performance degradation. From the success in using Cycle-consistent Generative Adversarial Networks(CycleGAN) to learn unpaired image-to-image mappings, we propose a new method to help alleviate the issues caused by domain shifts in images. The proposed model incorporates an adversarial loss to encourage realistic-looking images in the target domain, a cycle-consistency loss to learn an unpaired image-to-image mapping, and a semantic loss from a task network to improve the generator’s performance. The task network is con-currently trained with the generators on the generated images to improve downstream task performance on adapted images. By utilizing the power of CycleGAN, we can learn to classify images in the target domain without any target domain labels. In this research, we show that our model is successful on various unsupervised domain adaptation (UDA) datasets and can alleviate domain shifts for different adaptation tasks, like classification or semantic segmentation. In our experiments on standard classification, we were able to bring the models performance to near oracle level accuracy on a variety of different classification datasets. The semantic segmentation experiments showed that our model could improve the performance on the target domain, but there is still room for further improvements. We also further analyze where our model performs well and where improvements can be made.
  • Thumbnail Image
    Item
    Leveraging Machine Learning in the Search for New Bosons at the LHC and Other Resulting Applications
    (University of the Witwatersrand, Johannesburg, 2023-09) Stevenson, Finn David; Mellado, Bruce
    This dissertation focuses on the use of semi-supervised machine learning for data generation in high-energy physics, specifically to aid in the search for new bosons at the Large Hadron Collider. The overarching physics analysis for this work involves the development of a generative machine learning model to assist in the search for resonances in the Zγ final state background data. A number of Variational Auto-encoder (VAE) derivatives are developed and trained to be able to generate a chosen Monte Carlo fast simulated dataset. These VAE derivatives are then evaluated using chosen metrics and plots to assess their performance in data generation. Overall, this work aims to demonstrate the utility of semi-supervised machine learning techniques in the search for new resonances in high-energy physics. Additionally, a resulting application of the use of machine learning in COVID-19 crisis management was also documented.
  • Thumbnail Image
    Item
    Machine Learning on biochemical data for the prediction of mutation presence in suspected Familial Hypercholesterolaemia
    (2024) Hesse, Reinhardt
    Background Familial hypercholesterolemia (FH) is a common monogenic disorder and, if not diagnosed and treated early, results in premature atherosclerotic cardiovascular disease. Most individuals with FH are undiagnosed due to limitations in current screening and diagnostic approaches, but the advent of machine learning (ML) offers a new prospect to identify these individuals. Our objective was to create a ML model from basic lipid profile data with better screening performance than low-density lipoprotein cholesterol (LDL-C) cut-off levels and diagnostic performance comparable to the Dutch Lipid Clinic Network (DLCN) criteria. Methods The ML model was developed using a combination of logistic regression, deep learning and random forest classification and was trained on a 70% split of an internal dataset consisting of 555 individuals clinically suspected of having FH. The performance of the model, as well as that of the LDL-C cut-off and DLCN criteria, were assessed on both the internal 30% testing dataset and a high prevalence external dataset by comparing the area under the receiver operator characteristic (AUROC) curves. All three methodologies were measured against the gold standard of FH diagnosis by mutation identification. Furthermore, the ML model was also tested on two lower prevalence datasets derived from the same external dataset. Results The ML model achieved an AUROC curve of 0.711 on the high prevalence external dataset (n=1376; FH prevalence=64%), which was superior to that of the LDL-C cut off alone (AUROC=0.642) and comparable to that of the DLCN criteria (AUROC=0.705). The model performed even better when tested on the medium prevalence (n=2655; FH prevalence=20%) and low prevalence (n=1616; FH prevalence=1%) datasets, with AUROC curve values of 0.801 and 0.856 respectively. Conclusions Despite the absence of clinical information, the ML model was better at correctly identifying genetically confirmed FH in a cohort of individuals suspected of having FH than the LDL-C cut-off tool and comparable to the DLCN criteria. The same ML model performed even better when tested on two cohorts with lower FH prevalence. The application of ML is therefore a promising tool in both the screening for, and diagnosis of, individuals with FH.
  • Thumbnail Image
    Item
    The use of machine learning techniques in identifying gender differentials in COVID-19 hospitalizations, probabilities of hospitalization outcomes and hidden correlations with demographic and clinical factors
    (2024) Malaatjie, Meghan Abigail
    Background: Sex-differentiated data on hospitalisation frequency, case severity, pre-existing medical conditions, and mortality outcomes amongst Covid-19 hospitalised patients is needed but limited in Gauteng province, the epicentre of the Covid-19 pandemic in South Africa. This study aims to investigate whether Machine Learning techniques can provide insight into gender differentials in COVID-19 hospitalizations throughout the four waves of the pandemic, in the Gauteng province of South Africa. Method: A weak supervision learning algorithm was used to perform binary classification. The training of a DNN was performed on 14 features of patient characteristics (Demographic variables, presence of comorbidity, care received upon admission and setting of care), to separate the two classes of data sets: a) severe disease class (a proxy measure of higher severity, which included those who died during admission or were admitted into an intensive care (ICU) or high care unit (HCU)), and b) less severe disease class. Results: The number of Covid-19 hospitalisations was highest in wave 3 for both males and females, and higher in females than males across all 4 waves. The observed difference in COVID-19 hospitalization frequency between men and women was the highest in the 20 - 40-year age group with a ratio of 1:3. There was a higher frequency of COVID-19 hospitalization for hypertension, diabetes, and HIV frequencies across all age groups. Conclusion: This study demonstrated the utility of machine learning for analysing multidimensional sexdisaggregated data to provide accurate, real-time information for public health monitoring of sexdifferences in the Gauteng province.
  • Thumbnail Image
    Item
    Estimating and predicting HIV risk using statistical and machine learning methods: a case study using the 2005 to 2015 Zimbabwe demographic health survey data
    (2024) Makota, Rutendo Beauty Birri
    Background: The 90–90–90 targets were launched by the Joint United Nations Programme on HIV/AIDS (UNAIDS) and partners with the aim to diagnose 90% of all HIV-positive persons, provide antiretroviral therapy (ART) for 90% of those diagnosed, and achieve viral suppression for 90% of those treated by 2020. In Zimbabwe, a population-based survey in 2016 reported that 74.2% of people living with HIV (PLHIV) aged 15–64 years knew their HIV status. Among the PLHIV who knew their status, 86.8% self-reported current use of Antiretroviral treatment (ART), with 86.5% of those who self-reported being virally suppressed. For these 90–90–90 targets to be met, prevalence and incidence rate estimates are crucial in understanding the current status of the HIV epidemic and determining whether the trends are improving to achieve the 2030 target. Ultimately, this will contribute to the achievement of Sustainable Development Goals 3 (SDG 3) and the broader goal of promoting sustainable development and eradicating poverty worldwide by 2030. Using data from household surveys, this thesis provides a unique statistical approach for estimating the incidence and prevalence of the Human Immunodeficiency Virus (HIV). To properly assess the efficacy of focused public health interventions and to appropriately forecast the HIV-related burden placed on healthcare systems, a comprehensive assessment of HIV incidence is essential. Targeting certain age groups with a high risk of infection is necessary to increase the effectiveness of public health interventions. To jointly estimate age-and-timedependent HIV incidence and diagnosis rates, the methodological focus of this thesis was on developing a comprehensive statistical framework for age-dependent HIV incidence estimates. Additionally, the risk of HIV infection was also evaluated using interval censoring methods and machine learning. Finally, geospatial modelling techniques were also utilised to determine the spatial patterns of HIV incidence at district levels and identify hot spots for HIV risk to guide policy. The main aim of this thesis was to estimate and predict HIV risk using statistical and machine learning methods. Study objectives: The study objectives of this thesis were: 1. To determine the effect of several drivers/factors of HIV infection on survival time over a decade in Zimbabwe, using current status data. 2. To determine common risk factors of HIV positivity in Zimbabwe and the prediction capability of machine learning models. 3. To estimate HIV incidence using the catalytic and Farrington models and to test the validity of these estimates at the national and sub-national levels. 4. To estimate the age- and time-dependent prevalence and HIV Force-of-infection (FOI) using current status data by comparing parametric, semi-parametric and non-parametric models; and determining which models best fit the data. 5. To investigate the HIV incidence hotspots in Zimbabwe by using geographicallyweighted regression. Methods: We performed secondary data analysis on cross-sectional data collected from the Zimbabwe Demographic Health Survey (ZDHS) from 2005 to 2015. Datasets from three Zimbabwe Demographic Health Survey HIV test results and adult interviews were merged, and records without an HIV test result were excluded from the analysis. The outcome variable was HIV status. Survey and cluster-adjusted logistic regression were used to determine variables for use in survival analysis with HIV status as the outcome variable. Covariates found significant in the logistic regression were used in survival analysis to determine the factors associated with HIV infection over the ten years. The data for the survival analysis was modelled assuming age at survey imputation (Model 1) and interval-censoring (Model 2). To determine the risk of HIV infection using machine learning methods, the prediction model was fit by adopting 80% of the data for learning/training and 20% for testing/prediction. Resampling was done using the stratified 5-fold cross-validation procedure repeatedly. The best algorithm was the one with the highest F1 score, which was then used to identify individuals with a higher likelihood of HIV infection. Considering that the proportion of those HIV negative and positive was imbalance with a ratio of 4.2:1, we applied resampling methods to handle the class imbalance. We performed the Synthetic Minority Over-sampling Technique (SMOTE) to balance the classes. We evaluated two alternative methods for predicting HIV incidence in Zimbabwe between 2005 and 2015. We estimated HIV incidence from seroprevalence data using the catalytic and Farrington-2-parameter models. These models were validated at the micro and macro levels using community-based cohort incidence and empirical estimates from UNAIDS EPP/SPECTRUM, respectively. To ascertain the age-time effects of HIV risk, we estimated the age- and time-dependent HIV FOI using current status data. Five generalised additive models were explored, ranging from linear, semi-parametric, non-parametric and nonproportional hazards additive models. The Akaike Information Criteria was used to select the best model. The best model was then used to estimate the age- and time-dependent HIV prevalence and force-of-infection. The OLS model was fitted for each survey year to determine the global relationship between HIV incidence and the significant covariates. The Moran's I spatial autocorrelation method was used to assess the spatial independence of residuals. The Getis-Ord Gi* statistic was used for Hotspot Analysis, which identifies statistically significant hot and cold spots using a set of weighted features. Interpolation maps of HIV incidence were created using Empirical Bayesian Kriging to produce smooth surfaces of HIV incidence for visualisation and data generation at the district level. The Multiscale Geographically Weighted Regression method was used to see if the relationship between HIV incidence and covariates varied by district. The software used in the thesis analysis included R software, STATA, Python, ArcGIS and WinBugs. Results: Model goodness of fit test based on the Cox-Snell residuals against the cumulative hazard indicated that the model with interval censoring was the best. On the contrary, the Akaike Information Criterion (AIC) indicated that the normal survival model was the best. Factors associated with a high risk of HIV infection were being female, the number of sexual partners, and having had an STI in the past year prior to the survey. The machine learning model indicated that the XGBoost model had better performance compared to the other 5 models for both the original data and SMOTE processed data. Identical variablesfor both sexes throughout the three survey years for predicting HIV status were: total lifetime number of sex partners, cohabitation duration (grouped), number of household members, age of household head, times away from home in last 12 months, beating justified and religion. The two most influential variable for both males and females were total lifetime number of sex partners and cohabitation duration (grouped). According to these findings, the catalytic model estimated a higher HIV incidence rate than the Farrington model. Compared to cohort estimates, the estimates were within the observed 95% confidence interval, with 88% and 75% agreement for the catalytic and Farrington models, respectively. The limits of agreement observed in the Bland-Altman plot were narrow for all plots, indicating that our model estimates were comparable to cohort estimates. Compared to UNAIDS estimates, the catalytic model predicted a progressive increase in HIV incidence for males throughout all survey years. Without a doubt, HIV incidence declined with each subsequent survey year for all models. Based on birth year cohort-specific prevalence, the female HIV prevalence peaks at approximately 29 years of age and then declines. Between 15 and 30 years, males have a lower cohort-specific prevalence than females. Male cohort-specific prevalence decreases marginally between ages 33 and 39, then peaks at age 40. In all age categories, the cohort-specific FOI is greater in females than males. Moreover, the cohortspecific HIV FOI peaked at age 22 for females and age 40 for males. A 18-year age gap between the male and female HIV FOI peaks was observed. Throughout the decade covered by this study, the Tsholotsho district remained a 99 % confidence hotspot. The impact of STI, condom use and being married on HIV incidence has been strong in the Eastern parts of Zimbabwe with Mashonaland Central, Mashonaland East and Manicaland provinces. From our findings from the Multiscale Geographically Weighted Regression (MGWR), we observed that Matabeleland North’s HIV incidence rates are driven by wealth index, multiple sex partners, STI and females with older partners. Conclusions: The difference between the results from the Cox-Snell residuals graphical method and the model estimates and AIC value may be due to inadequate methods to test the goodness-of-fit of interval-censored data. We concluded that Model 2 with interval-censoring gave better estimates due to its consistency with the published results from the literature. Even though we consider the interval-censoring model as the superior model with regard to our specific data, the method had its own set of limitations. Programmes targeted at HIV testing could use the machine learning approach to identify high-risk individuals. In addition to other risk reduction techniques, machine learning may aid in identifying those who might require Pre-exposure prophylaxis. Based on our results, older men and younger women resembled patterns of higher HIV prevalence and force-of-infection than younger men and older women. This could be an indication of age-disparate sexual relationships. Therefore, HIV prevention programmes should be targeted more at younger females and older males. Lastly, to improve programmatic and policy decisions in the national HIV response, we recommend the triangulation of multiple methods for incidence estimation and interpretation of results. Multiple estimating approaches should be considered to reduce uncertainty in the estimations from various models. The study spread the message that various factors differ from district to district and over time. The study's findings could be useful to policymakersin terms of resource allocation in the context of public health programs. The findings of this study also highlight the importance of focusing on districts like Tsholotsho, which have consistently had a high HIV burden over time. The main strength of this study is dependent on the quality of the data obtained from the surveys. These data were derived from population-based surveys, which provide more reliable and robust data. Another strength of this study was that we did not restrict our analysis to one method; however, we had the opportunity to determine the risk and incidence of HIV by exploring different methodologies. However, the limited number of variables accessible to us for this study constituted one of its drawbacks. We could not determine the impact of variables including viral load, health care spending, HIV- risk groups, and other HIV-related interventions. Additionally, there were missing values in the data, which required making assumptions about their unpredictability and utilising imputation methods that are inherently flawed. Last but not least, a number of the variables were self-reported and, as a result, were vulnerable to recall bias and social desirability bias.
  • Thumbnail Image
    Item
    Predicting in-hospital mortality in heart failure patients using machine learning
    (2024) Mpanya, Dineo
    The age of onset and causes of heart failure differ between high-income and low-and-middle-income countries (LMIC). Heart failure patients in LMIC also experience a higher mortality rate. Innovative ways that can risk stratify heart failure patients in this region are needed. The aim of this study was to demonstrate the utility of machine learning in predicting all-cause mortality in heart failure patients hospitalised in a tertiary academic centre. Six supervised machine learning algorithms were trained to predict in-hospital all-cause mortality using data from 500 consecutive heart failure patients with a left ventricular ejection fraction (LVEF) less than 50%. The mean age was 55.2 ± 16.8 years. There were 271 (54.2%) males, and the mean LVEF was 29 ± 9.2%. The median duration of hospitalisation was 7 days (interquartile range: 4–11), and it did not differ between patients discharged alive and those who died. After a prediction window of 4 years (interquartile range: 2–6), 84 (16.8%) patients died before discharge from the hospital. The area under the receiver operating characteristic curve was 0.82, 0.78, 0.77, 0.76, 0.75, and 0.62 for random forest, logistic regression, support vector machines (SVM), extreme gradient boosting, multilayer perceptron (MLP), and decision trees, and the accuracy during the test phase was 88, 87, 86, 82, 78, and 76% for random forest, MLP, SVM, extreme gradient boosting, decision trees, and logistic regression. The support vector machines were the best performing algorithm, and furosemide, beta-blockers, spironolactone, early diastolic murmur, and a parasternal heave had a positive coefficient with the target feature, whereas coronary artery disease, potassium, oedema grade, ischaemic cardiomyopathy, and right bundle branch block on electrocardiogram had negative coefficients. Despite a small sample size, supervised machine learning algorithms successfully predicted all-cause mortality with modest accuracy. The SVM model will be externally validated using data from multiple cardiology centres in South Africa before developing a uniquely African risk prediction tool that can potentially transform heart failure management