3. Electronic Theses and Dissertations (ETDs) - All submissions
Permanent URI for this communityhttps://wiredspace.wits.ac.za/handle/10539/45
Browse
3 results
Search Results
Item An empirical analysis and application of the expectation-maximization and matrix completion algorithms for varying degrees of missing data(2020) Thulare, Evans MolahlegiIncomplete data sets have been a problem in most studies, however, few studies have come to realise that imputation is a solution to this problem. Incomplete data can have a significant effect on the conclusion drawn and decision made. To solve the problem of incomplete data, one should use techniques to recover those missing values, depending on how much the data is missing, how big is the data, how the data has gone missing, etc. In this report, we aimed to compare the performance of the EM algorithm and matrix completion when imputing the missing values for varying degrees of missing data. Kullback-Leibler (KL) divergence was used as an evaluation metric to observe the performance of Expectation-Maximization (EM) algorithm and matrix completion when estimating missing values relative to the ground-truth distribution. The findings of this research shows that the EM algorithm outperformed matrix completion in both the theoretical (the simulated scenarios of learning from varying degrees of missing data) and the application (the application of theoretical model on realworld data on credit card fraud) models. Few similarities of the algorithms were observed when recovering missing values such as the increasing trend of error as missing values increases and the impact of increasing number of variables in a data set. Matrix completion only performed better when missing values were beyond approximately 77%. Therefore, from our findings, we conclude that when less than 50% of the data is missing, EM algorithm produces accurate predictions. The EM algorithm performed better compared to the matrix completion since it first learned the data itself and used maximum likelihood procedures to estimate the parameters of the model while the matrix completion analysed the existing pattern from rows and columns and imputes them using the pattern learned in the data.Item Evaluation of cluster analysis and latent class analysis in clustering(2019) Murisa, TatendaThe study compares the performance of latent class, K-means and hierarchical clustering on data with different degrees of cluster overlap. It also assesses how various standardisation methods affect the results of hierarchical and K-means clustering. Several distance and agglomeration methods are evaluated to observe how they perform depending on cluster overlap. Three artificial datasets were simulated whose clusters were poorly, moderately and well separated. These along with the seeds data were run through the three clustering methods. Several external validity indices were calculated for each cluster solution. The adjusted Rand index was used for comparison in the discussion because it is not affected by the number of clusters. Results showed that Ward’s method performed better compared to all other agglomeration methods and the Manhattan distance performed better across the different cluster types in hierarchical clustering. Latent class clustering performed better for poorly and well separated clusters. When the variance of the variables were comparable, K-means clustering with no standardisation performed well. Standardisation by the maximum value and z-score had the best cluster recovery when the variance of variables were large.Item Knowledge extraction in population health datasets: an exploratory data mining approach(2018) Khangamwa, GiftThereisagrowingtrendintheutilizationofmachinelearninganddataminingtechniques for knowledge extraction in health datasets. In this study, we used machine learning methods for data exploration and model building and we built classifier models for anemia. Anemia is recognized as a crucial public health challenge that leads to poor health for mothers and infants and one of its main causes is malaria. WeusedadatasetfromMalawiwheretheprevalenceofthesetwohealthchallenges of malaria and anemia remains high. We employed machine learning algorithms for the task of knowledge extraction on these demographic and health data sets for Malawi for the survey years 2004 and 2010. We followed the cross-industry standard processfordataminingmethodologytoguideourstudy. Thedatasetwasobtained, cleaned and prepared for experimentation. Unsupervised machine learning methods were used to understand the nature of the data set and the natural groupings in it. On the other hand,supervised machine learning methods were used to build predictive models for anemia. Specifically, we used principal component analysis and clustering algorithms in our unsupervised machine learning experiments. Support vector machines and decision trees were used in the supervised machine learning experiments. Unsupervised ML methods revealed that there was no significant separation of clustering according to both malaria and anemia attributes. However, attributes such as age, economic status, health practices attributes and number of children a woman has, were clustered insignificantly different ways,i.e.,young and old women went to different clusters. Moreover, PCA results confirmed these findings. Supervised methods, on the other hand, revealed that anemia classifiers could be developed using SVM and DTs for the dataset. The best performing models attained accuracy of 86%, ROC area score of 86%, mean absolute error of 0.27, and kappaof 0.78,which was built using an SVM model having C = 100, γ = 10−18. On the other hand, DTs produced the best model having accuracy 73%, ROC area score 74%, mean absolute error 0.36 and Kappa statistic of 0.449. In conclusion, we successfullybuiltagoodanemiaclassifierusingSVMandalsoshowedtherelationship between important attributes in the classification of anemia.