Predicting HIV status among women in South Africa using machine learning: comparing decision tree model and logistic regression
No Thumbnail Available
Date
2020
Authors
Oladokun, Oluwabukola Oluwapelumi
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The HIV epidemic has grown immensely to become a serious public health problem globally.
940,000 people died from HIV in 2017, and approximately 1.8 million new infections were
reported worldwide in the same year. Almost half of all new HIV infections are in women aged
15-24 years old in sub-Saharan Africa. In addition, South Africa has the highest HIV rate
worldwide with an estimated 7.2 million people living with the virus in the country. To effectively
manage this epidemic, better understanding of the sociodemographic factors that influence the risk
of seroconversion is needed. This can be obtained by creating a model of the HIV epidemic
especially among at-risk populations. More specifically, the aim of this study is to predict the HIV
status of an individual, given readily available demographic data using decision tree and
comparing the results with traditional logistic regression.
Individual recode data was gotten from DHS 2016 for women in South Africa. The study sample
was 7808 women aged 15-49 years living in South Africa. Data was split into training (75%) and
testing (25%) datasets. The logistic regression model had the highest accuracy for both training
(62.90%) and testing dataset (68.039%). Accuracy for the decision tree model was 63.93%. The
AUCs from the ROC curve reported 0.652 and 0.682 for the DT and LG respectively. This means
that on average, a woman will be predicted as HIV negative 65.2% of the time as compared to
being HIV positive using the DT model and 68.2% using the LG model.
The accuracy of both models was not high enough with the logistic regression unexpectedly having
a higher accuracy, the accuracy of the decision tree model could have been impacted due to
overfitting. In addition, demographic data might not be enough to accurately predict HIV status
especially at the medical classification level, or more variables are needed to build the model. It is
also recommended that different input features be tested, as well as automatic relevance detection
to assess which inputs contribute to the output of the model
Description
A project report submitted to the Faculty of Humanities, University of the Witwatersrand,
Johannesburg, in partial fulfilment of the requirements for the degree of Master of Arts in E-science (Data Science)