An empirical comparison of supervised machine learning techniques for record linkage of data from health facilities and demographic surveillance system: a case study from rural North-East South Africa

No Thumbnail Available

Date

2018-10

Authors

Jezile, Vuyokazi Sharon

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Record linkage of electronic patient records based on conventional personal identifiers is the most widely used method of integrating information from different sources. The record linkage of Health Demographic Surveillance System data and hospital records offer an opportunity for researchers to improve the quality of clinical research data and to discover new insights from clinical research data repositories. Record linkage algorithms are usually applied to different data sources with the aim of increasing the amount of information available to answer research questions that could not be originally answered from individual databases. Most record linkage systems do not compare results (using sensitivity and specificity criteria) from different algorithms before ascertaining a match, possible match, or non-match. There is minimal research done in comparing these different algorithms using health-related datasets. This study seeks to fill this gap by implementing and comparing the K-nearest neighbor classification (also known as K-nearest neighbor algorithm) and Support Vector Machine with deterministic and probabilistic record-linking techniques. The empirical evaluation performed on a linkage between the core dataset from The Agincourt Health and Demographic Surveillance System and electronic medical records from Bhubezi Community Health Centre. This work serves as a basis for other Health and Demographic Surveillance System sites in linking their Demographic Surveillance System census datasets with external data repositories. Match rates and error rates for the three strategies are compared, and discussions of their similarities and differences, strengths, and weaknesses are presented. The results showed that the supervised machine learning techniques performed better compared to unsupervised techniques. Support Vector Machine (SVM) and K Nearest Neighbor (KNN) techniques had a high sensitivity, specificity and PPV for all the three blocks respectively. However, when looking at the number of true matched records, KNN performed better when compared to Support Vector Machine in all the three blocks. The true matches for K-nearest neighbor were 1 995, 1 727, and 911 respectively with sensitivity ranging from 90%, 84.68% to 96.16%. The positive predictive value ranged from 93.33%, 83.62% to 70.23% respectively. The f-score measure for SVM and KNN is ranging from 99.97% to 99.98%, however for EM its ranging from 87% to 91%. The results clearly show the supervised machine learning techniques perform very well compared to the unsupervised techniques.

Description

A research report submitted to the Faculty of Health Sciences, the University of the Witwatersrand in partial fulfilment of the requirements for the degree of Master of Science, October 2018

Keywords

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By