An empirical comparison of supervised machine learning techniques for record linkage of data from health facilities and demographic surveillance system: a case study from rural North-East South Africa

dc.contributor.authorJezile, Vuyokazi Sharon
dc.date.accessioned2019-07-17T10:34:46Z
dc.date.available2019-07-17T10:34:46Z
dc.date.issued2018-10
dc.descriptionA research report submitted to the Faculty of Health Sciences, the University of the Witwatersrand in partial fulfilment of the requirements for the degree of Master of Science, October 2018en_ZA
dc.description.abstractRecord linkage of electronic patient records based on conventional personal identifiers is the most widely used method of integrating information from different sources. The record linkage of Health Demographic Surveillance System data and hospital records offer an opportunity for researchers to improve the quality of clinical research data and to discover new insights from clinical research data repositories. Record linkage algorithms are usually applied to different data sources with the aim of increasing the amount of information available to answer research questions that could not be originally answered from individual databases. Most record linkage systems do not compare results (using sensitivity and specificity criteria) from different algorithms before ascertaining a match, possible match, or non-match. There is minimal research done in comparing these different algorithms using health-related datasets. This study seeks to fill this gap by implementing and comparing the K-nearest neighbor classification (also known as K-nearest neighbor algorithm) and Support Vector Machine with deterministic and probabilistic record-linking techniques. The empirical evaluation performed on a linkage between the core dataset from The Agincourt Health and Demographic Surveillance System and electronic medical records from Bhubezi Community Health Centre. This work serves as a basis for other Health and Demographic Surveillance System sites in linking their Demographic Surveillance System census datasets with external data repositories. Match rates and error rates for the three strategies are compared, and discussions of their similarities and differences, strengths, and weaknesses are presented. The results showed that the supervised machine learning techniques performed better compared to unsupervised techniques. Support Vector Machine (SVM) and K Nearest Neighbor (KNN) techniques had a high sensitivity, specificity and PPV for all the three blocks respectively. However, when looking at the number of true matched records, KNN performed better when compared to Support Vector Machine in all the three blocks. The true matches for K-nearest neighbor were 1 995, 1 727, and 911 respectively with sensitivity ranging from 90%, 84.68% to 96.16%. The positive predictive value ranged from 93.33%, 83.62% to 70.23% respectively. The f-score measure for SVM and KNN is ranging from 99.97% to 99.98%, however for EM its ranging from 87% to 91%. The results clearly show the supervised machine learning techniques perform very well compared to the unsupervised techniques.en_ZA
dc.description.librarianXL2019en_ZA
dc.identifier.urihttps://hdl.handle.net/10539/27695
dc.language.isoenen_ZA
dc.titleAn empirical comparison of supervised machine learning techniques for record linkage of data from health facilities and demographic surveillance system: a case study from rural North-East South Africaen_ZA
dc.typeThesisen_ZA

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Research Report_MSc_0418142M_VJezile_17102018.pdf
Size:
762.47 KB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections