A comparative study of the Wasserstein Distance Generative Adversarial Network and SMOTE density-based over-sampling approaches in addressing class imbalance
No Thumbnail Available
Date
2020
Authors
Ngwenduna, Kwanda Sydwell
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In binary classification problems, class imbalance occurs if one of the classes has
overwhelmingly more instances than others. This causes a significant bias in the
accuracy of Machine Learning (ML) classifiers. A pioneering and popular approach
to alleviate class imbalance is the Synthetic Minority Over-sampling TEchnique
(SMOTE). However, SMOTE is less reliant on the true underlying probability distribution
of the minority class data. Probability density estimation approaches have
recently been adopted, but most of these postulate the unknown probability distribution
of the minority class, which can be subjective and inappropriate. Generative
Adversarial Networks (GANs) can sample from the true underlying probability
distribution without explicitly specifying its form. GANs have been used to create
realistic samples and outperforms other deep generative models. However, there
has been limited theoretical and empirical reviews comparing generative models
such as GANs and other SMOTE density-based approaches for alleviating class imbalance,
especially for tabular data sets akin to most financial institutions.
This report compares Wasserstein Conditional GAN with gradient penalty
(WCGAN-GP) to density-based SMOTE approaches for synthetic minority sample
generation on a number of imbalanced data sets. A Logistic Regression (LR) model
is trained to detect minority cases on the imbalanced and over-sampled data sets,
compared using Precision, Recall, F1-Score and the Receiver Operating Characteristic
(ROC) curve on a testing data set. On average, WCGAN-GP yields better results,
followed by SMOTE, with RWO and PDFOS having the worst performance than
the Baseline. WCGAN-GP shows a statistically superior predictive performance
over SMOTE density estimation techniques on 4 of the 5 data sets used. These results
show a significant potential for GANs as an alternative to SMOTE density
techniques, useful for new sample creation, data augmentation and boosting classification
models
Description
A research report submitted in partial fulfillment of the requirements for the
degree of Master of Science in the field of e-Science
in the
School of Computer Science and Applied Mathematics,
University of the Witwatersrand