Predicting HIV status among women in South Africa using machine learning: comparing decision tree model and logistic regression

dc.contributor.authorOladokun, Oluwabukola Oluwapelumi
dc.date.accessioned2021-02-23T13:03:01Z
dc.date.available2021-02-23T13:03:01Z
dc.date.issued2020
dc.descriptionA project report submitted to the Faculty of Humanities, University of the Witwatersrand, Johannesburg, in partial fulfilment of the requirements for the degree of Master of Arts in E-science (Data Science)en_ZA
dc.description.abstractThe HIV epidemic has grown immensely to become a serious public health problem globally. 940,000 people died from HIV in 2017, and approximately 1.8 million new infections were reported worldwide in the same year. Almost half of all new HIV infections are in women aged 15-24 years old in sub-Saharan Africa. In addition, South Africa has the highest HIV rate worldwide with an estimated 7.2 million people living with the virus in the country. To effectively manage this epidemic, better understanding of the sociodemographic factors that influence the risk of seroconversion is needed. This can be obtained by creating a model of the HIV epidemic especially among at-risk populations. More specifically, the aim of this study is to predict the HIV status of an individual, given readily available demographic data using decision tree and comparing the results with traditional logistic regression. Individual recode data was gotten from DHS 2016 for women in South Africa. The study sample was 7808 women aged 15-49 years living in South Africa. Data was split into training (75%) and testing (25%) datasets. The logistic regression model had the highest accuracy for both training (62.90%) and testing dataset (68.039%). Accuracy for the decision tree model was 63.93%. The AUCs from the ROC curve reported 0.652 and 0.682 for the DT and LG respectively. This means that on average, a woman will be predicted as HIV negative 65.2% of the time as compared to being HIV positive using the DT model and 68.2% using the LG model. The accuracy of both models was not high enough with the logistic regression unexpectedly having a higher accuracy, the accuracy of the decision tree model could have been impacted due to overfitting. In addition, demographic data might not be enough to accurately predict HIV status especially at the medical classification level, or more variables are needed to build the model. It is also recommended that different input features be tested, as well as automatic relevance detection to assess which inputs contribute to the output of the modelen_ZA
dc.description.librarianCK2021en_ZA
dc.facultyFaculty of Humanitiesen_ZA
dc.identifier.urihttps://hdl.handle.net/10539/30612
dc.language.isoenen_ZA
dc.titlePredicting HIV status among women in South Africa using machine learning: comparing decision tree model and logistic regressionen_ZA
dc.typeThesisen_ZA
Files
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Final Thesis Report- Pelumi.pdf
Size:
984.43 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description:
Collections