Credit card fraud detection system using extreme gradient boosting machine and isolated forest
Date
2023
Authors
Serongwa, Tselahale Lloyd
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
An extreme gradient boosting machine (a supervised technique) and an isolation forest model (an unsupervised technique) are compared in the development of a credit card fraud detection system based on the misuse of credit cards. Simulated data from Sparkov Data Generation was used with 1 296 675 data points and 99.4% of the transactions being non-fraudulent. Adasyn was used to sample the dataset and the weight of evidence was used to bin classes and reduce dimensionality. XGBoost resulted in a superior model as iForest predictors failed to scale up to any variation of the XGBoost predictors. The top eleven predictive variables are from five original variables, the time, age, job, amount and the merchant. These had a combined prediction power of 48.2%. This work established that temporal information is a key aspect of the dataset. The date and timestamp established features which resulted in 18% total predictive power, the highest predictive variable for the optimal model. An aggregation of the amount and age was amongst the top predictors at 8.5% predictive power. It was shown that supervised techniques yield reliable and powerful predictive models in dynamic multi-dimensional mixed data type problems as compared to their unsupervised counterparts. The complex dynamic nature of fraud detection dataset is more compatible to a model that learns from labels as many fraudulent transactions mimic legitimate transactions closely. It was gathered however, that reducing the number of features improved the performance of an iForest model in a multi dimensional dataset. The optimal XGBoost model learnt at a 0.1 rate, built decision trees 10 levels deep with 40 estimators. The model had an average of 98% f1-score,0.96 Cohen’s kappa and 96% gini coefficient for the testing dataset as compared to 98% f1-score, 0.973 Cohen’s kappa and 97.3% gini coefficient for the training dataset. The optimal model is stable and robust as no single variable contributes more than 18% total predictive power.
Description
A research report submitted in partial fulfillment of the requirements for the degree of Master of Science to the Faculty of Science, School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, 2023