Record linkage of national health laboratory service (NHLS) HIV datasets to cancer registry datasets using supervised learning techniques

No Thumbnail Available

Date

2019

Authors

Olago, Victor

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

introduction The National Health Laboratory Service (NHLS) is a national network of public health laboratories that serves more 80% of the South African population. All the laboratories are connected to a single data repository called the corporate data warehouse. The South African NCR is a pathology based registry housed within the NHLS. The NCR collates and analyses cancers diagnosed in pathology laboratories nationwide. These two data repositories present the opportunity to link HIV and cancer data to improve cancer surveillance among HIV positive people. Such linkage studies have made major contributions in understanding the epidemiology of HIV related cancers in developed countries. Although probabilistic methods have been used to link HIV and cancer datasets in South Africa before, it is computationally intensive and not scalable at national level. Supervised machine learning has been shown to be scalable and efficient in linking records accurately. In this work, our aim was to use Support Vector Machine (SVM) algorithms to link national HIV laboratory data to NCR data for the period 2004 to 2014. methods We used Cluster of Differentiation 4 (CD4) counts, DeoxyriboNucleic Acid - Polymerase Chain Reaction (DNA-PCR) and Enzyme-Linked ImmunoSorbent Assay (ELISA) tests for the HIV data and laboratory confirmed cancers in the NCR. We linked the two datasets using names, surname, gender and date of birth since there was no common unique identifier. We used Python 3.6 running on Spyder terminal for the linkage. The linkage process involved data pre-processing, deterministic de-duplication, chunking, probabilistic de-duplication, blocking, pairwise comparison then records pair classification using SVM. After the linkage we performed high dimensional clustering using Gaussian Mixture Model (GMM). results NHLS HIV dataset had 39, 249, 147 HIV test records while the NCR cancer dataset had 664, 869 laboratory confirmed cancers for the period 2004 to 2014. The de-duplication of the HIV dataset resulted to 15, 157, 685 HIV positive patients, 3, 696, 121 HIV negative patients and 41, 147 patients had no valid HIV results. The matched dataset resulted in 309, 741 linked records. A total of 231, 945(74.88%) records had an HIV positive result compared to 69, 648(22.49%) and 8, 148(2.63%) records with HIV negative and no valid HIV result respectively. The matched dataset had 212, 993(68.76%) and 96, 718(31.23%) females and males respectively. The distribution of the race was 78.45%, 9.42%, 9.19% and 0.85% for Blacks, Whites, Coloured and Asians respectively. The age at the time of cancer diagnosis was 10 years younger for HIV positive compared to HIV negative cancer patients. The proportion with AIDS-defining cancers was 50.67% compared to 49.33% non-AIDS defining cancers. The precision, recall and F-measure for the linkage were 0.883, 0.997 and 0.937 respectively based on the records with national identification (ID) numbers as the ground truth. conclusion Our study demonstrated that SVM algorithms are an effective way of linking large datasets in the absence of unique identifiers. Such techniques enable the linkage of disease registries in developing countries with accuracy. This methodology provides opportunities for enriching HIV cohort data with routinely collected laboratory and treatment data of other co-morbidities to inform public health actions.

Description

A Research Report submitted to the Faculty of Health Sciences in partial fulfilment of the requirements for the degree of Master of Science (MSc) in Epidemiology - Research Data Management August, 2019

Keywords

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By