Record linkage of national health laboratory service (NHLS) HIV datasets to cancer registry datasets using supervised learning techniques
No Thumbnail Available
Date
2019
Authors
Olago, Victor
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
introduction
The National Health Laboratory Service (NHLS) is a national network of public health
laboratories that serves more 80% of the South African population. All the laboratories
are connected to a single data repository called the corporate data warehouse. The South
African NCR is a pathology based registry housed within the NHLS. The NCR collates and
analyses cancers diagnosed in pathology laboratories nationwide. These two data repositories
present the opportunity to link HIV and cancer data to improve cancer surveillance
among HIV positive people. Such linkage studies have made major contributions in understanding
the epidemiology of HIV related cancers in developed countries. Although
probabilistic methods have been used to link HIV and cancer datasets in South Africa before,
it is computationally intensive and not scalable at national level. Supervised machine
learning has been shown to be scalable and efficient in linking records accurately. In this
work, our aim was to use Support Vector Machine (SVM) algorithms to link national HIV
laboratory data to NCR data for the period 2004 to 2014.
methods
We used Cluster of Differentiation 4 (CD4) counts, DeoxyriboNucleic Acid - Polymerase
Chain Reaction (DNA-PCR) and Enzyme-Linked ImmunoSorbent Assay (ELISA) tests for
the HIV data and laboratory confirmed cancers in the NCR. We linked the two datasets
using names, surname, gender and date of birth since there was no common unique identifier.
We used Python 3.6 running on Spyder terminal for the linkage. The linkage process
involved data pre-processing, deterministic de-duplication, chunking, probabilistic
de-duplication, blocking, pairwise comparison then records pair classification using SVM.
After the linkage we performed high dimensional clustering using Gaussian Mixture
Model (GMM).
results
NHLS HIV dataset had 39, 249, 147 HIV test records while the NCR cancer dataset had
664, 869 laboratory confirmed cancers for the period 2004 to 2014. The de-duplication of
the HIV dataset resulted to 15, 157, 685 HIV positive patients, 3, 696, 121 HIV negative
patients and 41, 147 patients had no valid HIV results. The matched dataset resulted in
309, 741 linked records. A total of 231, 945(74.88%) records had an HIV positive result
compared to 69, 648(22.49%) and 8, 148(2.63%) records with HIV negative and no valid
HIV result respectively. The matched dataset had 212, 993(68.76%) and 96, 718(31.23%)
females and males respectively. The distribution of the race was 78.45%, 9.42%, 9.19%
and 0.85% for Blacks, Whites, Coloured and Asians respectively. The age at the time of
cancer diagnosis was 10 years younger for HIV positive compared to HIV negative cancer
patients. The proportion with AIDS-defining cancers was 50.67% compared to 49.33%
non-AIDS defining cancers. The precision, recall and F-measure for the linkage were 0.883,
0.997 and 0.937 respectively based on the records with national identification (ID) numbers
as the ground truth.
conclusion
Our study demonstrated that SVM algorithms are an effective way of linking large datasets
in the absence of unique identifiers. Such techniques enable the linkage of disease registries
in developing countries with accuracy. This methodology provides opportunities
for enriching HIV cohort data with routinely collected laboratory and treatment data of
other co-morbidities to inform public health actions.
Description
A Research Report submitted to the Faculty of Health Sciences in partial
fulfilment of the requirements for the degree of Master of Science (MSc) in
Epidemiology - Research Data Management
August, 2019