Classifying cancerous tumours using machine learning techniques
Cancer has become a leading cause of death in the modern world. Literature suggests that in the modern world, a third of the population will develop cancer during their lifetime. The focus of the dissertation was to classify tumours as malignant or benign tumours. The data was obtained via the Surveillance, Epidemiology and End Results (SEER) program, which collected data from 1973 to 2018. The SEER program gives data on cancer incidences obtained from the United States population and represents 28% of the United States population. The data set contained variables such as age, race, sex, year of diagnosis and tumour classification, along with 14 other variables. The methods used for modeling were K-Nearest Neighbours (KNN), Weighted K-Nearest Neighbours, Artificial Neural Networks, Naive Bayes classifier and Bayesian Neural Networks. All models above used Synthetic Minority Oversampling Techniques (SMOTE), as the data set was imbalanced with a ratio of 40 to 1 for the malignant tumours. The best model for the data set was the KNN model with five neighbours and SMOTE application, with an area under the curve (AUC) of 0.781.
A dissertation submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, in ful lment of the requirements for the degree of Master of Science