Credit card fraud detection system using extreme gradient boosting machine and isolated forest

Serongwa, Tselahale Lloyd

Credit card fraud detection system using extreme gradient boosting machine and isolated forest

Files

547313_MSc_eScience_Credit Card Fraud Detection System Using XGBoost and iForest_TLSerongwa.pdf (2.35 MB)

Date

2023

Authors

Serongwa, Tselahale Lloyd

Abstract

An extreme gradient boosting machine (a supervised technique) and an isolation forest model (an unsupervised technique) are compared in the development of a credit card fraud detection system based on the misuse of credit cards. Simulated data from Sparkov Data Generation was used with 1 296 675 data points and 99.4% of the transactions being non-fraudulent. Adasyn was used to sample the dataset and the weight of evidence was used to bin classes and reduce dimensionality. XGBoost resulted in a superior model as iForest predictors failed to scale up to any variation of the XGBoost predictors. The top eleven predictive variables are from five original variables, the time, age, job, amount and the merchant. These had a combined prediction power of 48.2%. This work established that temporal information is a key aspect of the dataset. The date and timestamp established features which resulted in 18% total predictive power, the highest predictive variable for the optimal model. An aggregation of the amount and age was amongst the top predictors at 8.5% predictive power. It was shown that supervised techniques yield reliable and powerful predictive models in dynamic multi-dimensional mixed data type problems as compared to their unsupervised counterparts. The complex dynamic nature of fraud detection dataset is more compatible to a model that learns from labels as many fraudulent transactions mimic legitimate transactions closely. It was gathered however, that reducing the number of features improved the performance of an iForest model in a multi dimensional dataset. The optimal XGBoost model learnt at a 0.1 rate, built decision trees 10 levels deep with 40 estimators. The model had an average of 98% f1-score,0.96 Cohen’s kappa and 96% gini coefficient for the testing dataset as compared to 98% f1-score, 0.973 Cohen’s kappa and 97.3% gini coefficient for the training dataset. The optimal model is stable and robust as no single variable contributes more than 18% total predictive power.

Description

A research report submitted in partial fulfillment of the requirements for the degree of Master of Science to the Faculty of Science, School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, 2023

URI

https://hdl.handle.net/10539/35737

Collections

ETD Collection

Full item page

Credit card fraud detection system using extreme gradient boosting machine and isolated forest

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By