Analytical frameworks for studies prone to misclassification bias to identify correlates of disease protection Case study: immunity to malaria
Wambui, Kennedy Mwai
Background: The process of identifying statistical associations between important biological features is challenging especially for outcomes prone to misclassification and significant biological complexity. Additionally, experimental conditions may introduce systematic and nonsystematic bias when measuring biological features. These misclassification and systematics errors if not handled can lead to bias in describing, framing, selecting, and comparing cases and controls. Consequently, researchers across the globe, are facing challenges in making data-driven decisions and recommendations. This is particularly relevant for studies on natural immunity to malaria given the approaches used to categorize the non-case and clinical malaria cases in studies aimed at identifying correlates of immunity against malaria. The primary aim of this study was to develop analytical approaches that are sensitive to the challenge that the outcome and biological features are prone to differential misclassification and systematic bias respectively, then identify the correlates of protection against disease. Methodology This project had three major steps to achieve the overall objective. First, we used methodologies to reduce bias in malaria case definition, then developed approaches to handle bias in high throughput protein micro array data. Finally, we utilised machine learning approaches to select correlates of malaria protection. To reduce bias in clinical data, we utilised logistic regression and Bayesian approaches adjusting for the repeated measures. To do this, we used longitudinal data from Kilifi, Kenya to investigate whether the varying intensity of malaria transmission had a significant impact on parasite density thresholds and case definition. Further we used the same dataset to explore an alternative statistical approach of using the probability of developing fevers in defining cases rather than discrete threshold cut-offs as the former has been reported to increase predictive power. To handle the bias associated with biological measurements, we developed and compared methods of correcting systematic and non-systematic biases in high throughput laboratory data. Additionally, we created an R package suite for dealing with handling the systematic and non-systematic biases in protein micro-array data. Lastly, we compared the performance of Random Forest, GPBoosting and Bayesian with spike and slab prior machine learning approaches to in identifying merozoite targets of protective immunity against Plasmodium falciparum malaria using multi-centre cohort studies data. Random forest approach has been reported to handle overfitting bias through bagging. GPBoosting utilizes tree-based approach with Gaussian Process and Mixed Effects Models this adjusting for the site random effect. The Bayesian with spike and slab prior performs well in data prone to multicollinearity and produces readily available uncertainty estimates. This final part covers two main themes, the first theme involves using dichotomous outcome and secondly, utilizing probability of having fever given a certain parasite density as the outcome in feature selection using Kilifi data only. A network model was used to help visualize complex associations and linkages among different sites and the identified features. Results. To perform case definition, malaria attributable fractions were estimated using logistic power and Bayesian latent class approaches. Both approaches estimated similar patterns of fevers attributable to malaria with changing transmission intensities. The former performed well in estimating the probabilities of having fever, while the latter was efficient in determining the parasite density threshold. However, compared to the logistic power model, the Bayesian algorithm estimates were less than the former for both malaria attributable fractions and probabilities of fever. The Bayesian latent class gave a lower MAF estimate, Bland-Altman bias =0.20 (0.16-0.24), compared to the logistic model. After correcting (or accounting for) the bias associated with the outcome, we develop a generic one-stop-shop pre-processing suite for protein microarrays that is compatible with data from the major protein microarray scanners. The suite incorporated graphical and tabular interfaces to facilitate a detailed inspection of data and is coupled with supporting guidelines that enable users to select the most appropriate algorithms to systematically address bias arising in customized experiments. From the various machine learning approaches, proteins PF113, PF3D7_0525800(IMC1g), AMA1, MSP11, PF3D7_1136200 were identified as the features most highly associated with immunity to malaria. In addition, the utilization of probability outcome had a better statistical fit compared to utilizing a dichotomous outcome in the subset analysis that was done for data from a high transmission area of Kilifi. Discussion. Logistic regression and Bayesian approaches to handle misclassification bias gave similar estimates of malaria attributable fractions used for case definition. In addition, we observe that utilizing probabilities as a way of handling misclassification bias in malaria outcome had a better fit compared to using dichotomous variable. However, a training sample is required to utilize the Bayesian latent class models. Finally, we developed an R-package to handle systematic and non-systematic biases associated with high throughput assays such as protein micro-arrays. Specifically, we develop functions to correct for background noise, within-sample variation, mean-variance dependence, and batch correction. We also included a user-friendly interactive Shiny web-based platform thus eliminating the need for prowess in programming. Combining the data from case definition and protein micro-array and utilizing machine learning approaches we identified previously described and novel antigens (IMC1g and PF3D7_0629500) that are potential targets for naturally acquired immunity against malaria. However, these and other antigens with a protective effect would be recommended for further functional assays to inform vaccine development. The probability outcome-based model had a better statistical fit than the binary model helping in reducing the misclassification bias. Nevertheless, in our study, there was a limitation of lack of training samples for most of the multicentre cohorts thus we only tested the model on data from only one site in Kilifi (Junju, Kilifi cohort). To the best of our knowledge this is the largest malaria multicentre study to compare different methodologies for identifying important correlates of immunity to malaria and to demonstrate that probability-based model is better than a binary outcome model in identifying important s features. We propose a workflow of ensuring disease is defined without bias, handle systematic and non-systematic laboratory data bias and perform feature selection.
A thesis submitted in fulfilment of the requirements for the Degree of Doctor of Philosophy to the Faculty of Health Sciences, School of Public Health, University of the Witwatersrand Johannesburg, 2021