A comparative study of the Wasserstein Distance Generative Adversarial Network and SMOTE density-based over-sampling approaches in addressing class imbalance

Ngwenduna, Kwanda Sydwell

A comparative study of the Wasserstein Distance Generative Adversarial Network and SMOTE density-based over-sampling approaches in addressing class imbalance

Files

MSc_ResearchReport_Kwanda_377356_Final_Submission_2020.pdf (1.25 MB)

Date

2020

Authors

Ngwenduna, Kwanda Sydwell

Abstract

In binary classification problems, class imbalance occurs if one of the classes has overwhelmingly more instances than others. This causes a significant bias in the accuracy of Machine Learning (ML) classifiers. A pioneering and popular approach to alleviate class imbalance is the Synthetic Minority Over-sampling TEchnique (SMOTE). However, SMOTE is less reliant on the true underlying probability distribution of the minority class data. Probability density estimation approaches have recently been adopted, but most of these postulate the unknown probability distribution of the minority class, which can be subjective and inappropriate. Generative Adversarial Networks (GANs) can sample from the true underlying probability distribution without explicitly specifying its form. GANs have been used to create realistic samples and outperforms other deep generative models. However, there has been limited theoretical and empirical reviews comparing generative models such as GANs and other SMOTE density-based approaches for alleviating class imbalance, especially for tabular data sets akin to most financial institutions. This report compares Wasserstein Conditional GAN with gradient penalty (WCGAN-GP) to density-based SMOTE approaches for synthetic minority sample generation on a number of imbalanced data sets. A Logistic Regression (LR) model is trained to detect minority cases on the imbalanced and over-sampled data sets, compared using Precision, Recall, F1-Score and the Receiver Operating Characteristic (ROC) curve on a testing data set. On average, WCGAN-GP yields better results, followed by SMOTE, with RWO and PDFOS having the worst performance than the Baseline. WCGAN-GP shows a statistically superior predictive performance over SMOTE density estimation techniques on 4 of the 5 data sets used. These results show a significant potential for GANs as an alternative to SMOTE density techniques, useful for new sample creation, data augmentation and boosting classification models

Description

A research report submitted in partial fulfillment of the requirements for the degree of Master of Science in the field of e-Science in the School of Computer Science and Applied Mathematics, University of the Witwatersrand

URI

https://hdl.handle.net/10539/33015

Collections

ETD Collection

Full item page

A comparative study of the Wasserstein Distance Generative Adversarial Network and SMOTE density-based over-sampling approaches in addressing class imbalance

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By