An empirical comparison of supervised machine learning techniques for record linkage of data from health facilities and demographic surveillance system: a case study from rural North-East South Africa
No Thumbnail Available
Date
2018-10
Authors
Jezile, Vuyokazi Sharon
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Record linkage of electronic patient records based on conventional personal identifiers is the most widely
used method of integrating information from different sources. The record linkage of Health Demographic
Surveillance System data and hospital records offer an opportunity for researchers to improve the quality
of clinical research data and to discover new insights from clinical research data repositories. Record
linkage algorithms are usually applied to different data sources with the aim of increasing the amount of
information available to answer research questions that could not be originally answered from individual
databases. Most record linkage systems do not compare results (using sensitivity and specificity criteria)
from different algorithms before ascertaining a match, possible match, or non-match. There is minimal
research done in comparing these different algorithms using health-related datasets. This study seeks to
fill this gap by implementing and comparing the K-nearest neighbor classification (also known as K-nearest
neighbor algorithm) and Support Vector Machine with deterministic and probabilistic record-linking
techniques. The empirical evaluation performed on a linkage between the core dataset from The Agincourt
Health and Demographic Surveillance System and electronic medical records from Bhubezi Community
Health Centre. This work serves as a basis for other Health and Demographic Surveillance System sites in linking their Demographic Surveillance System census datasets with external data repositories. Match rates and error rates for the three strategies are compared, and discussions of their similarities and
differences, strengths, and weaknesses are presented. The results showed that the supervised machine
learning techniques performed better compared to unsupervised techniques. Support Vector Machine
(SVM) and K Nearest Neighbor (KNN) techniques had a high sensitivity, specificity and PPV for all the
three blocks respectively. However, when looking at the number of true matched records, KNN performed
better when compared to Support Vector Machine in all the three blocks. The true matches for K-nearest
neighbor were 1 995, 1 727, and 911 respectively with sensitivity ranging from 90%, 84.68% to 96.16%.
The positive predictive value ranged from 93.33%, 83.62% to 70.23% respectively. The f-score measure
for SVM and KNN is ranging from 99.97% to 99.98%, however for EM its ranging from 87% to 91%. The
results clearly show the supervised machine learning techniques perform very well compared to the
unsupervised techniques.
Description
A research report submitted to the Faculty of Health Sciences, the University of the Witwatersrand in
partial fulfilment of the requirements for the degree of Master of Science, October 2018