Monyai, Koena2024-02-062024-02-062024https://hdl.handle.net/10539/37517A research report submitted in partial fulfilment of the requirements for the degree Master of Science to the Faculty of Science, School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, 2023Liquid chromatography and mass spectrometry-based experiments are currently used in Proteomics studies to discover novel biomarkers that can be used in clinical settings and assist in combating diseases such as cancer. Although researchers and scientists are discovering large quantities of biomarkers, only a few are applied in a clinical setting. The experimental procedure of proteomics studies is complex and prone to faults. Therefore, the inadequate number of adopted biomarkers is due to faults occurring during the experimental procedure. Quality control solutions have been implemented to increase the number of adopted biomarkers in clinical settings. Inconsistency is one of the significant hurdles in proteomics experiments as it causes the experimental results to be unreproducible. Reproducibility is necessary for every scientific study. Therefore, the majority of the tools monitor system consistency by identifying experiments or peaks indicating evidence of high technical variability. Consistency based quality control solutions require a large number of experiments, which are not always available or require substantial time to perform. Quality control solutions monitoring consistency can, therefore, result in delayed troubleshooting. In our study, we present an unsupervised quality control solution to classify isotopic peak pairs based on their quality, i.e. classify low-quality and high-quality peak pairs. Our solution comprises of a peak detection technique, feature engineering and unsupervised classification of the peak pairs using clustering and feature selection. Our studies focus on identifying peak pairs deviating from the expected peak shapes and show evidence of interference. Comparisons of the clustering techniques are made to determine if the different clustering techniques can classify the peak pairs based on quality. The performance of clustering techniques can be negatively affected by the presence of irrelevant and redundant features. Therefore, we also evaluate if a genetic algorithm based feature selection improves the clustering results. Our results show that using clustering techniques is likely to results in the misclassification of isotopic peak pairs as all the clusters contain both low-quality and high-quality peak pairs. Clustering techniques resulted in the majority of the low-quality data points being misclassified. Incorporating a feature selection technique to the clustering improved the overall performance of the techniques most notably significantly reduced the number of misclassified low-quality peak pairs. Before our solution can be employed as a quality control solution and adopted in the laboratories, we need to evaluate and optimise the peak detection and feature engineering steps.enUnsupervised learning approach to quality control of proteomics studiesDissertation