The impact of missing data imputation on HIV classification
Date
2009-11-04T13:08:31Z
Authors
Hlalele, Nthabiseng Unathi
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Missing data are a part of research and data analysis that often cannot be ignored. Although a
number of methods have been developed in handling and imputing missing data, this problem
is, for the most part, still unsolved with many researchers still struggling with its existence.
Due to the availability of software and the advancement of computational power, maximum
likelihood and multiple imputations as well as neural networks and genetic algorithms
(AANN-GA) have been introduced as solutions to the missing data problem. Although these
methods have given considerable results in this domain, the impact that missing data and
missing data imputation has on decision making has, until recently, not been assessed. This
dissertation contributes to this knowledge by first introducing a new computational intelligent
model that integrates Neuro-Fuzzy (N-F) modeling, Principal Component Analysis and the
genetic algorithms to impute missing data. The performance of this model is then compared
to that of the AANN-GA as well as the independent use of the N-F architecture. In order to
determine if the data are predictable and also to assist in processing the data for training, an
analysis on the HIV sero-prevalence data is performed.
Two classification decision making frameworks are then presented in order to assess the
effect of missing data. These decision frameworks are trained to classify between two
conditions when presented with a set of data variables. The first is the use of a Bayesian
neural network which is statistical in nature and the second is based on the fuzzy ARTMAP
(FAM) classifier which has incremental abilities. The two methods are used and compared in
order to assess the degree in which missing data, and the imputation thereof, has on decision
making. The effect of missing data differs for the two frameworks; while the Bayesian neural
network fails in the presence of missing data, the FAM classifier attempts to classify with a
decreased accuracy. This work has shown that although missing data and the imputation
thereof has an effect on decision making, the degree of that effect is dependent on the
decision making framework and on the model used for data imputation.