Exploring the efficacy of popular clustering techniques on gene expression data
No Thumbnail Available
Date
2020
Authors
Batista, S. TKS
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
High throughput data has presented a wealth of genomic information, but as of yet a golden standard has not been presented and tested as means for the analysis of this data. Posing the question of whether biological function can be inferred solely from gene expression data of a host at different states. In-light of the lack of information that exits on the procedure to be employed in a true gene expression data exploratory process, a robust methodology was implemented. This included the use of a wide array of clustering algorithms along with numerous validation indices to attempt to discover the natural biological classes that existed within significantly unannotated data. While not being the most novel of the machine-learning techniques proposed for such data analysis, the k-means algorithm outperformed other methods when validated using known model validation techniques. The testing of the functional biological validity of these results were found to present a sufficiently accurate image of the underlying biological functions. These results while promising would require further validation via experimental methods to ensure the accuracy of the biological inferences
Description
A dissertation submitted in fulfilment of the requirements for the degree Master of Science, in the School of Computer Science and Applied Mathematics, Faculty of Science, University of the Witwatersrand, Johannesburg, 2020